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Preface to the Second Edition 





Since the publication of the first edition of this book in 2008, significant 
developments have been made in metaheuristics, and new nature-inspired 
metaheuristic algorithms emerge, including cuckoo search and bat algo- 
rithms. Many readers have taken time to write to me personally, providing 
valuable feedback, asking for more details of algorithm implementation, 
or simply expressing interests in applying these new algorithms in their 
applications. 

In this revised edition, we strive to review the latest developments in 
metaheuristic algorithms, to incorporate readers’ suggestions, and to pro- 
vide a more detailed description to algorithms. Firstly, we have added 
detailed descriptions of how to incorporate constraints in the actual imple- 
mentation. Secondly, we have added three chapters on differential evolu- 
tion, cuckoo search and bat algorithms, while some existing chapters such 
as ant algorithms and bee algorithms are combined into one due to their 
similarity. Thirdly, we also explained artificial neural networks and sup- 
port vector machines in the framework of optimization and metaheuristics. 
Finally, we have been trying in this book to provide a consistent and uni- 
fied approach to metaheuristic algorithms, from a brief history in the first 
chapter to the unified approach in the last chapter. 

Furthermore, we have provided more Matlab programs. At the same 
time, we also omit some of the implementation such as genetic algorithms, 
as we know that there are many good software packages (both commercial 
and open course). This allows us to focus more on the implementation of 
new algorithms. Some of the programs also have a version for constrained 
optimization, and readers can modify them for their own applications. 

Even with the good intention to cover most popular metaheuristic al- 
gorithms, the choice of algorithms is a difficult task, as we do not have 
the space to cover every algorithm. The omission of an algorithm does not 
mean that it is not popular. In fact, some algorithms are very powerful 
and routinely used in many applications. Good examples are Tabu search 
and combinatorial algorithms, and interested readers can refer to the refer- 
ences provided at the end of the book. The effort in writing this little book 
becomes worth while if this book could in some way encourage readers’ 
interests in metaheuristics. 


Xin-She Yang 


August 2010 
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Preface to the First Edition 





Modern metaheuristic algorithms such as the ant colony optimization and 
the harmony search start to demonstrate their power in dealing with tough 
optimization problems and even NP-hard problems. This book reviews and 
introduces the state-of-the-art nature-inspired metaheuristic algorithms in 
optimization, including genetic algorithms (GA), particle swarm optimiza- 
tion (PSO), simulated annealing (SA), ant colony optimization (ACO), bee 
algorithms (BA), harmony search (HS), firefly algorithms (FA), photosyn- 
thetic algorithm (PA), enzyme algorithm (EA) and Tabu search. By imple- 
menting these algorithms in Matlab/Octave, we will use worked examples 
to show how each algorithm works. This book is thus an ideal textbook for 
an undergraduate and/or graduate course. As some of the algorithms such 
as the harmony search and firefly algorithms are at the forefront of current 
research, this book can also serve as a reference book for researchers. 

I would like to thank my editor, Andy Adamatzky, at Luniver Press for 
his help and professionalism. Last but not least, I thank my wife and son 
for their help. 


Xin-She Yang 


Cambridge, 2008 












Chapter 1 


INTRODUCTION 





It is no exaggeration to say that optimization is everywhere, from engi- 
neering design to business planning and from the routing of the Internet to 
holiday planning. In almost all these activities, we are trying to achieve cer- 
tain objectives or to optimize something such as profit, quality and time. 
As resources, time and money are always limited in real-world applica- 
tions, we have to find solutions to optimally use these valuable resources 
under various constraints. Mathematical optimization or programming is 
the study of such planning and design problems using mathematical tools. 
Nowadays, computer simulations become an indispensable tool for solving 
such optimization problems with various efficient search algorithms. 


1.1. OPTIMIZATION 


Mathematically speaking, it is possible to write most optimization problems 
in the generic form 


rere fi(x), (i = iL, 2, 5% M), (1.4) 
subject to hj(a)=0, (j =1,2,..., J), (1.2) 


where f;(a),h;(x) and g,(a) are functions of the design vector 
gw = (£1, 02, ...,2n)°. (1.4) 


Here the components x; of x are called design or decision variables, and 
they can be real continuous, discrete or the mixed of these two. 

The functions f;(a) where i = 1,2,...,.M are called the objective func- 
fons or simply cost functions, and in the case of M = 1, there is only a 
ete objective. The space spanned by the decision variables is called the 
space or search space #”, while the space formed by the objective 
igtion values is called the solution space or response space. The equali- 
tieS for hj; and inequalities for g, are called constraints. It is worth pointing 
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out that we can also write the inequalities in the other way > 0, and we 
can also formulate the objectives as a maximization problem. 

In a rare but extreme case where there is no objective at all, there are 
only constraints. Such a problem is called a feasibility problem because 
any feasible solution is an optimal solution. 

If we try to classify optimization problems according to the number 
of objectives, then there are two categories: single objective M = 1 and 
multiobjective M > 1. Multiobjective optimization is also referred to as 
multicriteria or even multi-attributes optimization in the literature. In 
real-world problems, most optimization tasks are multiobjective. Though 
the algorithms we will discuss in this book are equally applicable to mul- 
tiobjective optimization with some modifications, we will mainly place the 
emphasis on single objective optimization problems. 

Similarly, we can also classify optimization in terms of number of con- 
straints J+ K. If there is no constraint at all J = K = 0, then it is 
called an unconstrained optimization problem. If K = 0 and J > 1, it is 
called an equality-constrained problem, while J = 0 and K > 1 becomes 
an inequality-constrained problem. It is worth pointing out that in some 
formulations in the optimization literature, equalities are not explicitly in- 
cluded, and only inequalities are included. This is because an equality 
can be written as two inequalities. For example h(x) = 0 is equivalent to 
h(a) <0 and h(a) > 0. 

We can also use the actual function forms for classification. The objec- 
tive functions can be either linear or nonlinear. If the constraints h; and gz 
are all linear, then it becomes a linearly constrained problem. If both the 
constraints and the objective functions are all linear, it becomes a linear 
programming problem. Here ‘programming’ has nothing to do with com- 
puting programming, it means planning and/or optimization. However, 
generally speaking, all f;,h; and g; are nonlinear, we have to deal with a 
nonlinear optimization problem. 


1.2. SEARCH FOR OPTIMALITY 


After an optimization problem is formulated correctly, the main task is 
to find the optimal solutions by some solution procedure using the right 
mathematical techniques. 

Figuratively speaking, searching for the optimal solution is like treasure 
hunting. Imagine we are trying to hunt for a hidden treasure in a hilly 
landscape within a time limit. In one extreme, suppose we are blind- 

hers fold without any guidance, the search process is essentially a pure random 
search, which is usually not efficient as we can expect. In another extreme, 
wwe are told the treasure is placed at the highest peak of a known region, 
é will then directly climb up to the steepest cliff and try to reach to the 
Whighest peak, and this scenario corresponds to the classical hill-climbing 
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techniques. In most cases, our search is between these extremes. We are 
not blind-fold, and we do not know where to look for. It is a silly idea to 
search every single square inch of an extremely large hilly region so as to 
find the treasure. 

The most likely scenario is that we will do a random walk, while looking 
for some hints; we look at some place almost randomly, then move to an- 
other plausible place, then another and so on. Such random walk is a main 
characteristic of modern search algorithms. Obviously, we can either do 
the treasure-hunting alone, so the whole path is a trajectory-based search, 
and simulated annealing is such a kind. Alternatively, we can ask a group 
of people to do the hunting and share the information (and any treasure 
found), and this scenario uses the so-called swarm intelligence and corre- 
sponds to the particle swarm optimization, as we will discuss later in detail. 
If the treasure is really important and if the area is extremely large, the 
search process will take a very long time. If there is no time limit and if any 
region is accessible (for example, no islands in a lake), it is theoretically 
possible to find the ultimate treasure (the global optimal solution). 

Obviously, we can refine our search strategy a little bit further. Some 
hunters are better than others. We can only keep the better hunters and 
recruit new ones, this is something similar to the genetic algorithms or 
evolutionary algorithms where the search agents are improving. In fact, as 
we will see in almost all modern metaheuristic algorithms, we try to use the 
best solutions or agents, and randomize (or replace) the not-so-good ones, 
while evaluating each individual’s competence (fitness) in combination with 
the system history (use of memory). With such a balance, we intend to 
design better and efficient optimization algorithms. 

Classification of optimization algorithm can be carried out in many ways. 
A simple way is to look at the nature of the algorithm, and this divides the 
algorithms into two categories: deterministic algorithms, and stochastic 
algorithms. Deterministic algorithms follow a rigorous procedure, and its 
path and values of both design variables and the functions are repeatable. 
For example, hill-climbing is a deterministic algorithm, and for the same 
starting point, they will follow the same path whether you run the program 
today or tomorrow. On the other hand, stochastic algorithms always have 
some randomness. Genetic algorithms are a good example, the strings or 
solutions in the population will be different each time you run a program 
since the algorithms use some pseudo-random numbers, though the final 
results may be no big difference, but the paths of each individual are not 
exactly repeatable. 

gb Metter, Furthermore, there is a third type of algorithm which is a mixture, or 
x iybrid, of deterministic and stochastic algorithms. For example, hill- 
ibing with a random restart is a good example. The basic idea is to 
he deterministic algorithm, but start with different initial points. This 
certain advantages over a simple hill-climbing technique, which may be 
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stuck in a local peak. However, since there is a random component in this 
hybrid algorithm, we often classify it as a type of stochastic algorithm in 
the optimization literature. 


1.3. NATURE-INSPIRED METAHEURISTICS 


Most conventional or classic algorithms are deterministic. For example, the 
simplex method in linear programming is deterministic. Some determinis- 
tic optimization algorithms used the gradient information, they are called 
gradient-based algorithms. For example, the well-known Newton-Raphson 
algorithm is gradient-based, as it uses the function values and their deriva- 
tives, and it works extremely well for smooth unimodal problems. However, 
if there is some discontinuity in the objective function, it does not work 
well. In this case, a non-gradient algorithm is preferred. Non-gradient- 
based or gradient-free algorithms do not use any derivative, but only the 
function values. Hooke-Jeeves pattern search and Nelder-Mead downhill 
simplex are examples of gradient-free algorithms. 

For stochastic algorithms, in general we have two types: heuristic and 
metaheuristic, though their difference is small. Loosely speaking, heuristic 
means ‘to find’ or ‘to discover by trial and error’. Quality solutions to a 
tough optimization problem can be found in a reasonable amount of time, 
but there is no guarantee that optimal solutions are reached. It hopes 
that these algorithms work most of the time, but not all the time. This is 
good when we do not necessarily want the best solutions but rather good 
solutions which are easily reachable. 

Further development over the heuristic algorithms is the so-called meta- 
heuristic algorithms. Here meta- means ‘beyond’ or ‘higher level’, and 
they generally perform better than simple heuristics. In addition, all meta- 
heuristic algorithms use certain tradeoff of randomization and local search. 
It is worth pointing out that no agreed definitions of heuristics and meta- 
heuristics exist in the literature; some use ‘heuristics’ and ‘metaheuristics’ 
interchangeably. However, the recent trend tends to name all stochastic 
algorithms with randomization and local search as metaheuristic. Here we 
will also use this convention. Randomization provides a good way to move 
away from local search to the search on the global scale. Therefore, almost 
all metaheuristic algorithms intend to be suitable for global optimization. 

Heuristics is a way by trial and error to produce acceptable solutions to 
a complex problem in a reasonably practical time. The complexity of the 
problem of interest makes it impossible to search every possible solution 

“gr combination, the aim is to find good feasible solution in an acceptable 
‘amescale. There is no guarantee that the best solutions can be found, and 
@ even do not know whether an algorithm will work and why if it does 
ork. The idea is to have an efficient but practical algorithm that will 
eWwork most the time and is able to produce good quality solutions. Among 
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the found quality solutions, it is expected some of them are nearly optimal, 
though there is no guarantee for such optimality. 

Two major components of any metaheuristic algorithms are: intensifi- 
cation and diversification, or exploitation and exploration. Diversification 
means to generate diverse solutions so as to explore the search space on the 
global scale, while intensification means to focus on the search in a local 
region by exploiting the information that a current good solution is found 
in this region. This is in combination with with the selection of the best 
solutions. The selection of the best ensures that the solutions will converge 
to the optimality, while the diversification via randomization avoids the 
solutions being trapped at local optima and, at the same time, increases 
the diversity of the solutions. The good combination of these two major 
components will usually ensure that the global optimality is achievable. 

Metaheuristic algorithms can be classified in many ways. One way is 
to classify them as: population-based and trajectory-based. For example, 
genetic algorithms are population-based as they use a set of strings, so 
is the particle swarm optimization (PSO) which uses multiple agents or 
particles. 

On the other hand, simulated annealing uses a single agent or solution 
which moves through the design space or search space in a piecewise style. 
A better move or solution is always accepted, while a not-so-good move 
can be accepted with a certain probability. The steps or moves trace a tra- 
jectory in the search space, with a non-zero probability that this trajectory 
can reach the global optimum. 

Before we introduce all popular meteheuristic algorithms in detail, let 
us look at their history briefly. 


1.4 A BRIEF HISTORY OF METAHEURISTICS 


Throughout history, especially at the early periods of human history, we 
humans’ approach to problem-solving has always been heuristic or meta- 
heuristic — by trial and error. Many important discoveries were done 
by ‘thinking outside the box’, and often by accident; that is heuristics. 
Archimedes’s Eureka moment was a heuristic triumph. In fact, our daily 
learning experience (at least as a child) is dominantly heuristic. 

Despite its ubiquitous nature, metaheuristics as a scientific method to 
problem solving is indeed a modern phenomenon, though it is difficult to 
pinpoint when the metaheuristic method was first used. Alan Turing was 









n he was breaking German Enigma ciphers at Bletchley Park. Turing 
cd his search method heuristic search, as it could be expected it worked 


tit was a tremendous success. In 1945, Turing was recruited to the 
ational Physical Laboratory (NPL), UK where he set out his design for 
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the Automatic Computing Engine (ACE). In an NPL report on Intelligent 
machinery in 1948, he outlined his innovative ideas of machine intelligence 
and learning, neural networks and evolutionary algorithms. 

The 1960s and 1970s were the two important decades for the develop- 
ment of evolutionary algorithms. First, John Holland and his collaborators 
at the University of Michigan developed the genetic algorithms in 1960s 
and 1970s. As early as 1962, Holland studied the adaptive system and was 
the first to use crossover and recombination manipulations for modeling 
such system. His seminal book summarizing the development of genetic 
algorithms was published in 1975. In the same year, De Jong finished his 
important dissertation showing the potential and power of genetic algo- 
rithms for a wide range of objective functions, either noisy, multimodal or 
even discontinuous. 

In essence, a genetic algorithm (GA) is a search method based on the ab- 
straction of Darwinian evolution and natural selection of biological systems 
and representing them in the mathematical operators: crossover or recom- 
bination, mutation, fitness, and selection of the fittest. Ever since, genetic 
algorithms become so successful in solving a wide range of optimization 
problems, there have several thousands of research articles and hundreds 
of books written. Some statistics show that a vast majority of Fortune 
500 companies are now using them routinely to solve tough combinatorial 
optimization problems such as planning, data-fitting, and scheduling. 

During the same period, Ingo Rechenberg and Hans-Paul Schwefel both 
then at the Technical University of Berlin developed a search technique for 
solving optimization problem in aerospace engineering, called evolutionary 
strategy, in 1963. Later, Peter Bienert joined them and began to construct 
an automatic experimenter using simple rules of mutation and selection. 
There was no crossover in this technique, only mutation was used to pro- 
duce an offspring and an improved solution was kept at each generation. 
This was essentially a simple trajectory-style hill-climbing algorithm with 
randomization. As early as 1960, Lawrence J. Fogel intended to use simu- 
lated evolution as a learning process as a tool to study artificial intelligence. 
Then, in 1966, L. J. Fogel, together A. J. Owen and M. J. Walsh, developed 
the evolutionary programming technique by representing solutions as finite- 
state machines and randomly mutating one of these machines. The above 
innovative ideas and methods have evolved into a much wider discipline, 
called evolutionary algorithms and/or evolutionary computation. 

Although our focus in this book is metaheuristic algorithms, other al- 
gorithms can be thought as a heuristic optimization technique. These in- 

gh Meth cludes artificial neural networks, support vector machines and many other 
ay “machine learning techniques. Indeed, they all intend to minimize their 
earning errors and prediction (capability) errors via iterative trials and 
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Artificial neural networks are now routinely used in many applications. 
In 1943, W. McCulloch and W. Pitts proposed the artificial neurons as 
simple information processing units. The concept of a neural network was 
probably first proposed by Alan Turing in his 1948 NPL report concerning 
‘intelligent machinery’. Significant developments were carried out from the 
1940s and 1950s to the 1990s with more than 60 years of history. 

The support vector machine as a classification technique can date back to 
the earlier work by V. Vapnik in 1963 on linear classifiers, and the nonlinear 
classification with kernel techniques were developed by V. Vapnik and his 
collaborators in the 1990s. A systematical summary in Vapnik’s book on 
the Nature of Statistical Learning Theory was published in 1995. 

The two decades of 1980s and 1990s were the most exciting time for 
metaheuristic algorithms. The next big step is the development of simu- 
lated annealing (SA) in 1983, an optimization technique, pioneered by S. 
Kirkpatrick, C. D. Gellat and M. P. Vecchi, inspired by the annealing pro- 
cess of metals. It is a trajectory-based search algorithm starting with an 
initial guess solution at a high temperature, and gradually cooling down 
the system. A move or new solution is accepted if it is better; otherwise, 
it is accepted with a probability, which makes it possible for the system to 
escape any local optima. It is then expected that if the system is cooled 
down slowly enough, the global optimal solution can be reached. 

The actual first usage of memory in modern metaheuristics is probably 
due to Fred Glover’s Tabu search in 1986, though his seminal book on Tabu 
search was published later in 1997. 

In 1992, Marco Dorigo finished his PhD thesis on optimization and nat- 
ural algorithms, in which he described his innovative work on ant colony 
optimization (ACO). This search technique was inspired by the swarm in- 
telligence of social ants using pheromone as a chemical messenger. Then, in 
1992, John R. Koza of Stanford University published a treatise on genetic 
programming which laid the foundation of a whole new area of machine 
learning, revolutionizing computer programming. As early as in 1988, Koza 
applied his first patent on genetic programming. The basic idea is to use the 
genetic principle to breed computer programs so as to gradually produce 
the best programs for a given type of problem. 

Slightly later in 1995, another significant progress is the development 
of the particle swarm optimization (PSO) by American social psychologist 
James Kennedy, and engineer Russell C. Eberhart. Loosely speaking, PSO 
is an optimization algorithm inspired by swarm intelligence of fish and birds 
and by even human behavior. The multiple agents, called particles, swarm 

oneenaground the search space starting from some initial random guess. The 
ay fag communicates the current best and shares the global best so as to 
ig on the quality solutions. Since its development, there have been about 
ifferent variants of particle swarm optimization techniques, and have 
*1 applied to almost all areas of tough optimization problems. There is 
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some strong evidence that PSO is better than traditional search algorithms 
and even better than genetic algorithms for many types of problems, though 
this is far from conclusive. 

In around 1996 and later in 1997, R. Storn and K. Price developed their 
vector-based evolutionary algorithm, called differential evolution (DE), and 
this algorithm proves more efficient than genetic algorithms in many ap- 
plications. 

In 1997, the publication of the ‘no free lunch theorems for optimization’ 
by D. H. Wolpert and W. G. Macready sent out a shock way to the opti- 
mization community. Researchers have been always trying to find better 
algorithms, or even universally robust algorithms, for optimization, espe- 
cially for tough NP-hard optimization problems. However, these theorems 
state that if algorithm A performs better than algorithm B for some opti- 
mization functions, then B will outperform A for other functions. That is 
to say, if averaged over all possible function space, both algorithms A and B 
will perform on average equally well. Alternatively, there is no universally 
better algorithms exist. That is disappointing, right? Then, people real- 
ized that we do not need the average over all possible functions for a given 
optimization problem. What we want is to find the best solutions, which 
has nothing to do with average over all possible function space. In addition, 
we can accept the fact that there is no universal or magical tool, but we do 
know from our experience that some algorithms indeed outperform others 
for given types of optimization problems. So the research now focuses on 
finding the best and most efficient algorithm(s) for a given problem. The 
objective is to design better algorithms for most types of problems, not for 
all the problems. Therefore, the search is still on. 

At the turn of the 21st century, things became even more exciting. First, 
Zong Woo Geem et al. in 2001 developed the harmony search (HS) algo- 
rithm, which has been widely applied in solving various optimization prob- 
lems such as water distribution, transport modelling and scheduling. In 
2004, S. Nakrani and C. Tovey proposed the honey bee algorithm and its 
application for optimizing Internet hosting centers, which followed by the 
development of a novel bee algorithm by D. T. Pham et al. in 2005 and the 
artificial bee colony (ABC) by D. Karaboga in 2005. In 2008, the author of 
this book developed the firefly algorithm (FA)'. Quite a few research arti- 
cles on the firefly algorithm then followed, and this algorithm has attracted 
a wide range of interests. In 2009, Xin-She Yang at Cambridge University, 
UK, and Suash Deb at Raman College of Engineering, India, introduced 
an efficient cuckoo search (CS) algorithm, and it has been demonstrated 
that CS is far more effective than most existing metaheuristic algorithms 


& 
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including particle swarm optimization”. In 2010, the author of this book 
developed a bat-inspired algorithm for continuous optimization, and its 
efficiency is quite promising. 

As we can see, more and more metaheuristic algorithms are being devel- 
oped. Such a diverse range of algorithms necessitates a systematic summary 
of various metaheuristic algorithms, and this book is such an attempt to 
introduce all the latest nature-inspired metaheuristics with diverse appli- 
cations. 

We will discuss all major modern metaheuristic algorithms in the rest 
of this book, including simulated annealing (SA), genetic algorithms (GA), 
ant colony optimization (ACO), bee algorithms (BA), differential evolution 
(DE), particle swarm optimization (PSO), harmony search (HS), the firefly 
algorithm (FA), cuckoo search (CS) and bat-inspired algorithm (BA), and 
others. 
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Chapter 2 


RANDOM WALKS AND LEVY FLIGHTS 





From the brief analysis of the main characteristics of metaheuristic algo- 
rithms in the first chapter, we know that randomization plays an important 
role in both exploration and exploitation, or diversification and intensifi- 
cation. The essence of such randomization is the random walk. In this 
chapter, we will briefly review the fundamentals of random walks, Lévy 
flights and Markov chains. These concepts may provide some hints and 
insights into how and why metaheuristic algorithms behave. 


2.1 RANDOM VARIABLES 


Loosely speaking, a random variable can be considered as an expression 
whose value is the realization or outcome of events associated with a ran- 
dom process such as the noise level on the street. The values of random 
variables are real, though for some variables such as the number of cars on 
a road can only take discrete values, and such random variables are called 
discrete random variables. If a random variable such as noise at a particular 
location can take any real values in an interval, it is called continuous. If a 
random variable can have both continuous and discrete values, it is called 
a mixed type. Mathematically speaking, a random variable is a function 
which maps events to real numbers. The domain of this mapping is called 
the sample space. 

For each random variable, a probability density function can be used 
to express its probability distribution. For example, the number of phone 
calls per minute, and the number of users of a web server per day all obey 
the Poisson distribution 


Nre7A 


n! 





p(n; A) = ; (n = 0,1, 2,...), (2.1) 
re \ > 0 is a parameter which is the mean or expectation of the occur- 
of the event during a unit interval. 

ifferent random variables will have different distributions. Gaussian 
‘ibution or normal distribution is by far the most popular distribu- 
_{i0ns, because many physical variables including light intensity, and er- 
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rors/uncertainty in measurements, and many other processes obey the nor- 
mal distribution 


1 (x — p)? 

2 
XL; 07) = ex , 
P(2; 0°) = —T— exp[-S] 





oo <2 < 00, (2.2) 


where p is the mean and o > 0 is the standard deviation. This normal 
distribution is often denoted by N(,07). In the special case when ps = 0 
and o = 1, it is called a standard normal distribution, denoted by N(0, 1). 

In the context of metaheuristics, another important distribution is the 
so-called Lévy distribution, which is a distribution of the sum of N identi- 
cally and independently distribution random variables whose Fourier trans- 
form takes the following form 


Fy(k) = exp[-NIAI7I. (2.3) 


The inverse to get the actual distribution L(s) is not straightforward, as 
the integral 


co 
L(s) = -{ cos(Ts)e 7 dr, (0< 6 <2), (2.4) 
0 

does not have analytical forms, except for a few special cases. Here L(s) 
is called the Lévy distribution with an index @. For most applications, we 
can set a = 1 for simplicity. Two special cases are G = 1 and 9 = 2. When 
GB = 1, the above integral becomes the Cauchy distribution. When ( = 2, 
it becomes the normal distribution. In this case, Lévy flights become the 
standard Brownian motion. 

Mathematically speaking, we can express the integral (2.4) as an asymp- 
totic series, and its leading-order approximation for the flight length results 
in a power-law distribution 


L(s) ~ |s\-?~*, (25) 


which is heavy-tailed. The variance of such a power-law distribution is 
infinite for 0 < @ < 2. The moments diverge (or are infinite) for 0 < 6 < 2, 
which is a stumbling block for mathematical analysis. 


2.2 RANDOM WALKS 


A random walk is a random process which consists of taking a series of 


N 
Sy =) > Xi=Xit.. + Xn, (2.6) 


i=1 
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where X; is a random step drawn from a random distribution. This rela- 
tionship can also be written as a recursive formula 


N-1 
Sv = 5) +Xn =Sy-1 + Xn, (2.7) 


i=1 


which means the next state Sj will only depend the current existing state 
Sy_ 1 and the motion or transition Xy from the existing state to the next 
state. This is typically the main property of a Markov chain to be intro- 
duced later. 

Here the step size or length in a random walk can be fixed or varying. 
Random walks have many applications in physics, economics, statistics, 
computer sciences, environmental science and engineering. 

Consider a scenario, a drunkard walks on a street, at each step, he 
can randomly go forward or backward, this forms a random walk in one- 
dimensional. If this drunkard walks on a football pitch, he can walk in 
any direction randomly, this becomes a 2D random walk. Mathematically 
speaking, a random walk is given by the following equation 


Stat = Si + Wt, (2.8) 


where 5S; is the current location or state at t, and w; is a step or random 
variable with a known distribution. 

If each step or jump is carried out in the n-dimensional space, the ran- 
dom walk discussed earlier 


N 
Sv=S >, (2.9) 
w=1 


becomes a random walk in higher dimensions. In addition, there is no 
reason why each step length should be fixed. In fact, the step size can 
also vary according to a known distribution. If the step length obeys the 
Gaussian distribution, the random walk becomes the Brownian motion (see 
Fig. 2.1). 

In theory, as the number of steps N increases, the central limit theorem 
implies that the random walk (2.9) should approaches a Gaussian distribu- 
tion. As the mean of particle locations shown in Fig. 2.1 is obviously zero, 
their variance will increase linearly with t. In general, in the d-dimensional 
space, the variance of Brownian random walks can be written as 


a? (t) = |vo|?t? + (2dD)t, (2.10) 







1 vo is the drift velocity of the system. Here D = s?/(2r) is the 
ogtive diffusion coefficient which is related to the step length s over a 
rt time interval 7 during each jump. 
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Figure 2.1: Brownian motion in 2D: random walk with a Gaus- 
sian step-size distribution and the path of 50 steps starting at 
the origin (0,0) (marked with e). 


Therefore, the Brownian motion B(t) essentially obeys a Gaussian dis- 
tribution with zero mean and time-dependent variance. That is, B(t) ~ 
N(0,07(t)) where ~ means the random variable obeys the distribution on 
the right-hand side; that is, samples should be drawn from the distribution. 
A diffusion process can be viewed as a series of Brownian motion, and the 
motion obeys the Gaussian distribution. For this reason, standard diffusion 
is often referred to as the Gaussian diffusion. If the motion at each step is 
not Gaussian, then the diffusion is called non-Gaussian diffusion. 

If the step length obeys other distribution, we have to deal with more 
generalized random walk. A very special case is when the step length obeys 
the Lévy distribution, such a random walk is called a Lévy flight or Lévy 
walk. 


2.3. LEVY DISTRIBUTION AND LEVY FLIGHTS 


Broadly speaking, Lévy flights are a random walk whose step length is 
drawn from the Lévy distribution, often in terms of a simple power-law 
formula L(s) ~ |s|~!~° where 0 < 6 < 2 is an index. Mathematically 
speaking, a simple version of Lévy distribution can be defined as 





ar expl—a] epee 0<p<s<oo 
1 (6,754) = (2.11) 
0 otherwise, 
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Figure 2.2: Lévy flights in consecutive 50 steps starting at the 
origin (0,0) (marked with e). 


In general, Lévy distribution should be defined in terms of Fourier trans- 
form 


F(k) =exp[-alk\7], 0< 6 <2, (2.13) 


where a is a scale parameter. The inverse of this integral is not easy, as it 
does not have analytical form, except for a few special cases. 
For the case of @ = 2, we have 


F(k) = exp[-ak’], (2.14) 


whose inverse Fourier transform corresponds to a Gaussian distribution. 
Another special case is 3 = 1, and we have 


F'(k) = exp[—alk]], (2.15) 


which corresponds to a Cauchy distribution 


1 Y 
D2) fh ee (2.16) 
r+ (e—np 
where yp is the location parameter, while y controls the scale of this distri- 
bution. 
For the general case, the inverse integral 


L(s) = 5 [ cos(ks) exp[—alk|?]dk, (2.17) 
0 


t TT 
Fe ies sj 





| a BI(S)sin(n 8/2) 


L(s) m5) -8 ; 8 0. (2.18) 
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Here ['(z) is the Gamma function 


re= i Poe de, (2.19) 


In the case when z = n is an integer, we have ['(n) = (n — 1)!. 

Lévy flights are more efficient than Brownian random walks in exploring 
unknown, large-scale search space. There are many reasons to explain this 
efficiency, and one of them is due to the fact that the variance of Lévy 
flights 

2 3-8 
a (t)~t?, 1<6<2, (2.20) 


increases much faster than the linear relationship (i.e., 0?(t) ~ t) of Brow- 
nian random walks. 

Fig. 2.2 shows the path of Lévy flights of 50 steps starting from (0,0) 
with @ = 1. It is worth pointing out that a power-law distribution is often 
linked to some scale-free characteristics, and Lévy flights can thus show 
self-similarity and fractal behavior in the flight patterns. 

From the implementation point of view, the generation of random num- 
bers with Lévy flights consists of two steps: the choice of a random direction 
and the generation of steps which obey the chosen Lévy distribution. The 
generation of a direction should be drawn from a uniform distribution, while 
the generation of steps is quite tricky. There are a few ways of achieving 
this, but one of the most efficient and yet straightforward ways is to use 
the so-called Mantegna algorithm for a symmetric Lévy stable distribution. 
Here ‘symmetric’ means that the steps can be positive and negative. 

A random variable U and its probability distribution can be called stable 
if a linear combination of its two identical copies (or U; and U2) obeys the 
same distribution. That is, aU; + bU2 has the same distribution as cU + d 
where a,b > 0 andc,dE ®. If d= 0, it is called strictly stable. Gaussian, 
Cauchy and Lévy distributions are all stable distributions. 

In Mantegna’s algorithm, the step length s can be calculated by 


U 





s= pee” (2.21) 
where u and v are drawn from normal distributions. That is 
u~ N(0,02), v~ N(0,02), (2.22) 
tah — T(1+ 8)sin(7B/2) 1/6 
 aletahen = {na a aaa } 5 eye (2.23) 







is distribution (for s) obeys the expected Lévy distribution for |s| > |so| 
where so is the smallest step. In principle, |s9| >> 0, but in reality so can 
be taken as a sensible value such as sg = 0.1 to 1. 
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Studies show that Lévy flights can maximize the efficiency of resource 
searches in uncertain environments. In fact, Lévy flights have been observed 
among foraging patterns of albatrosses and fruit flies, and spider monkeys. 
Even humans such as the Ju/’hoansi hunter-gatherers can trace paths of 
Lévy-flight patterns. In addition, Lévy flights have many applications. 
Many physical phenomena such as the diffusion of fluorescent molecules, 
cooling behavior and noise could show Lévy-flight characteristics under the 
right conditions. 


2.4 OPTIMIZATION AS MARKOV CHAINS 


In every aspect, a simple random walk we discussed earlier can be consid- 
ered as a Markov chain. Briefly speaking, a random variable ¢ is a Markov 
process if the transition probability, from state ¢; = S; at time t to another 
state ¢,41 = S;, depends only on the current state ¢;. That is 


PUs9) = PGs = S5|Co = S53 voy Gt — Si) 


= P(G41 = S5|C = Si), (2.24) 


which is independent of the states before t. In addition, the sequence of ran- 
dom variables (Co, C1, ..-; Gn) generated by a Markov process is subsequently 
called a Markov chain. The transition probability P(i, 7) = P(t 7) = Pi; 
is also referred to as the transition kernel of the Markov chain. 

If we rewrite the random walk relationship (2.7) with a random move 
governed by w; which depends on the transition probability P, we have 


Sta = Si + Wt, (2.25) 


which indeed has the properties of a Markov chain. Therefore, a random 
walk is a Markov chain. 

In order to solve an optimization problem, we can search the solution by 
performing a random walk starting from a good initial but random guess 
solution. However, simple or blind random walks are not efficient. To be 
computationally efficient and effective in searching for new solutions, we 
have to keep the best solutions found so far, and to increase the mobility of 
the random walk so as to explore the search space more effectively. Most 
importantly, we have to find a way to control the walk in such a way that it 
can move towards the optimal solutions more quickly, rather than wander 


rther research along the route of Markov chains is that the devel- 
ént of the Markov chain Monte Carlo (MCMC) method, which is a 
; of sample-generating methods. It attempts to directly draw samples 
_frém some highly complex multi-dimensional distribution using a Markov 
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chain with known transition probability. Since the 1990s, the Markov chain 
Monte Carlo has become a powerful tool for Bayesian statistical analysis, 
Monte Carlo simulations, and potentially optimization with high nonlin- 
earity. 

An important link between MCMC and optimization is that some heuris- 
tic and metaheuristic search algorithms such as simulated annealing to be 
introduced later use a trajectory-based approach. They start with some ini- 
tial (random) state, and propose a new state (solution) randomly. Then, 
the move is accepted or not, depending on some probability. There is 
strongly similar to a Markov chain. In fact, the standard simulated an- 
nealing is a random walk. 

Mathematically speaking, a great leap in understanding metaheuristic 
algorithms is to view a Markov chain Monte carlo as an optimization pro- 
cedure. If we want to find the minimum of an objective function f(@) at 
6 = 6, so that f, = f(0.) < f(0), we can convert it to a target distribution 
for a Markov chain 

a(8) = e PFO), (2.26) 


where 9 > O is a parameter which acts as a normalized factor. ( value 
should be chosen so that the probability is close to 1 when 0 > 6,. At 
6 = 0,, 7(@) should reach a maximum 7, = 7(0.) > 7(@). This requires 
that the formulation of L(@) should be non-negative, which means that 
some objective functions can be shifted by a large constant A > 0 such as 
f —f +A if necessary. 

By constructing a Markov chain Monte Carlo, we can formulate a generic 
framework as outlined by Ghate and Smith in 2008, as shown in Figure 2.3. 
In this framework, simulated annealing and its many variants are simply a 
special case with 


exp[—7¢] if fir > fe 


x 
| 


1 if fis < fe 


In this case, only the difference Af between the function values is impor- 
tant. 

Algorithms such as simulated annealing, to be discussed in the next 
chapter, use a single Markov chain, which may not be very efficient. In 
practice, it is usually advantageous to use multiple Markov chains in paral- 
lel to increase the overall efficiency. In fact, the algorithms such as particle 
swarm optimization can be viewed as multiple interacting Markov chains, 
though such theoretical analysis remains almost intractable. The theory of 

panteracting Markov chains is complicated and yet still under development, 
y@wever, any progress in such areas will play a central role in the under- 
winding how population- and trajectory-based metaheuristic algorithms 
rform under various conditions. However, even though we do not fully 
“understand why metaheuristic algorithms work, this does not hinder us to 
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Markov Chain Algorithm for Optimization 


Start with Go € S, att =0 
while (criterion) 
Propose a new solution Y:41; 
Generate a random number 0 < P, <1; 





¢ oe with probability P, 
t+H1= 


Ct with probability 1 — Py (227) 


end 





Figure 2.3: Optimization as a Markov chain. 


use these algorithms efficiently. On the contrary, such mysteries can drive 
and motivate us to pursue further research and development in metaheuris- 
tics. 
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Chapter 3 


SIMULATED ANNEALING 





One of the earliest and yet most popular metaheuristic algorithms is simu- 
lated annealing (SA), which is a trajectory-based, random search technique 
for global optimization. It mimics the annealing process in material pro- 
cessing when a metal cools and freezes into a crystalline state with the 
minimum energy and larger crystal size so as to reduce the defects in 
metallic structures. The annealing process involves the careful control of 
temperature and its cooling rate, often called annealing schedule. 


3.1 ANNEALING AND BOLTZMANN DISTRIBUTION 


Since the first development of simulated annealing by Kirkpatrick, Gelatt 
and Vecchi in 1983, SA has been applied in almost every area of optimiza- 
tion. Unlike the gradient-based methods and other deterministic search 
methods which have the disadvantage of being trapped into local minima, 
the main advantage of simulated annealing is its ability to avoid being 
trapped in local minima. In fact, it has been proved that simulated an- 
nealing will converge to its global optimality if enough randomness is used 
in combination with very slow cooling. Essentially, simulated annealing is 
a search algorithm via a Markov chain, which converges under appropriate 
conditions. 

Metaphorically speaking, this is equivalent to dropping some bouncing 
balls over a landscape, and as the balls bounce and lose energy, they settle 
down to some local minima. If the balls are allowed to bounce enough times 
and lose energy slowly enough, some of the balls will eventually fall into 
the globally lowest locations, hence the global minimum will be reached. 

The basic idea of the simulated annealing algorithm is to use random 
search in terms of a Markov chain, which not only accepts changes that 

Riser. prove the objective function, but also keeps some changes that are not 
at, In a minimization problem, for example, any better moves or changes 
lecrease the value of the objective function f will be accepted; how- 
, some changes that increase f will also be accepted with a probability 
This probability p, also called the transition probability, is determined 
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by 
_ AE 
p=e *st, (3.1) 
where kp is the Boltzmann’s constant, and for simplicity, we can use k to 
denote kp because k = 1 is often used. T is the temperature for controlling 
the annealing process. AF is the change of the energy level. This transition 
probability is based on the Boltzmann distribution in statistical mechanics. 


The simplest way to link AF with the change of the objective function 
Af is to use 


AF = Af, (3.2) 


where ¥ is a real constant. For simplicity without losing generality, we can 
use kg = 1 and y = 1. Thus, the probability p simply becomes 


p(Af,T) =e". (3.3) 


Whether or not we accept a change, we usually use a random number r as 
a threshold. Thus, if p > r, or 


p=e T >T, (3.4) 


the move is accepted. 


3.2. PARAMETERS 


Here the choice of the right initial temperature is crucially important. For 
a given change Af, if T is too high (T’ — co), then p — 1, which means 
almost all the changes will be accepted. If T is too low (T — 0), then any 
Af > 0 (worse solution) will rarely be accepted as p — 0 and thus the 
diversity of the solution is limited, but any improvement Af will almost 
always be accepted. In fact, the special case JT’ — O corresponds to the 
gradient-based method because only better solutions are accepted, and the 
system is essentially climbing up or descending along a hill. Therefore, 
if T is too high, the system is at a high energy state on the topological 
landscape, and the minima are not easily reached. If T is too low, the 
system may be trapped in a local minimum (not necessarily the global 
minimum), and there is not enough energy for the system to jump out the 
wan minimum to explore other minima eae the global minimum. So 
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Two commonly used annealing schedules (or cooling schedules) are: lin- 
ear and geometric. For a linear cooling schedule, we have 


T = Tp — ft, (3.5) 


or T — T — 6T, where To is the initial temperature, and ¢ is the pseudo 
time for iterations. 9 is the cooling rate, and it should be chosen in such a 
way that T — 0 when t — ty (or the maximum number NV of iterations), 
this usually gives G = (To — Ty)/ty. 

On the other hand, a geometric cooling schedule essentially decreases 
the temperature by a cooling factor 0 < a < 1 so that T is replaced by aT’ 
or 

T(t) = Tha’, ba 1s gh hs (3.6) 
The advantage of the second method is that T — 0 when t — ov, and thus 
there is no need to specify the maximum number of iterations. For this 
reason, we will use this geometric cooling schedule. The cooling process 
should be slow enough to allow the system to stabilize easily. In practise, 
a = 0.7 ~ 0.99 is commonly used. 

In addition, for a given temperature, multiple evaluations of the objec- 
tive function are needed. If too few evaluations, there is a danger that the 
system will not stabilize and subsequently will not converge to its global 
optimality. If too many evaluations, it is time-consuming, and the system 
will usually converge too slowly, as the number of iterations to achieve 
stability might be exponential to the problem size. 

Therefore, there is a fine balance between the number of evaluations and 
solution quality. We can either do many evaluations at a few temperature 
levels or do few evaluations at many temperature levels. There are two 
major ways to set the number of iterations: fixed or varied. The first uses 
a fixed number of iterations at each temperature, while the second intends 
to increase the number of iterations at lower temperatures so that the local 
minima can be fully explored. 


3.3. SA ALGORITHM 


The simulated annealing algorithm can be summarized as the pseudo code 
shown in Fig. 3.1. 

In order to find a suitable starting temperature 7p, we can use any 
information about the objective function. If we know the maximum change 
max(Af) of the objective function, we can use this to estimate an initial 
temperature Jo for a given probability po. That is 


jMetaher,, 
ag - i 
“ poeee max(Af) 
e Inpo — 


x do not know the possible maximum change of the objective function, 
*can use a heuristic approach. We can start evaluations from a very 
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Simulated Annealing Algorithm 


Objective function f(x), w= (%41,...,2p)" 
Initialize initial temperature Ty and initial guess 2) 
Set final temperature Ts and maz number of iterations N 
Define cooling schedule T+ aT, (0<a< 1) 
while (T >T; andn<N ) 
Move randomly to new locations: Lj41 = Ly +€ (random walk) 
Calculate Af = fn4i(@n+1) — fn(an) 
Accept the new solution if better 
if not improved 
Generate a random number r 
Accept if p = exp|—Af/T] >r 
end if 
Update the best x, and f, 
n=n+l1 
end while 








Figure 3.1: Simulated annealing algorithm. 


high temperature (so that almost all changes are accepted) and reduce 
the temperature quickly until about 50% or 60% of the worse moves are 
accepted, and then use this temperature as the new initial temperature To 
for proper and relatively slow cooling. 

For the final temperature, it should be zero in theory so that no worse 
move can be accepted. However, if 7’ — 0, more unnecessary evaluations 
are needed. In practice, we simply choose a very small value, say, Ty = 
10-19 ~ 10-5, depending on the required quality of the solutions and time 
constraints. 


3.4 UNCONSTRAINED OPTIMIZATION 


Based on the guidelines of choosing the important parameters such as the 
cooling rate, initial and final temperatures, and the balanced number of 
iterations, we can implement the simulated annealing using both Matlab 
and Octave. 

For Rosenbrock’s banana function 


f(x,y) = (1-2)? + 100(y — 2°), 


cs 

Ss 
3 
a 
€ 
& 


know that its global 1 minimum i = 0 occurs at (1,1) (see i 3.2). This 
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Figure 3.2: Rosenbrock’s function with the global minimum f. = 0 at (1,1). 

















85 0 5 
Figure 3.3: 500 evaluations during the annealing iterations. The final global best 
is marked with e. 


global minimum easily and the last 500 evaluations during annealing are 







Gially those that are strongly multimodal and highly nonlinear. It is 
ightforward to extend the above program to deal with highly nonlinear 
Wltimodal functions. 
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Figure 3.4: The basic idea of stochastic tunneling by transform- 
ing f(x) to g(x), suppressing some modes and preserving the 
locations of minima. 


3.5 STOCHASTIC TUNNELING 


To ensure the global convergence of simulated annealing, a proper cooling 
schedule must be used. In the case when the functional landscape is com- 
plex, simulated annealing may become increasingly difficult to escape the 
local minima if the temperature is too low. Raising the temperature, as 
that in the so-called simulated tempering, may solve the problem, but the 
convergence is typically slow, and the computing time also increases. 

Stochastic tunneling uses the tunneling idea to transform the objective 
function landscape into a different but more convenient one (e.g., Wenzel 
and Hamacher, 1999). The essence is to construct a nonlinear transfor- 
mation so that some modes of f(a) are suppressed and other modes are 
amplified, while preserving the loci of minima of f(x). 

The standard form of such a tunneling transformation is 


g(x) = 1— exp[—9(f(x) — fo)], (3.7) 


where fo is the current lowest value of f(a) found so far. y > 0 is a scaling 
parameter, and g is the transformed new landscape. From this simple 
transformation, we can see that g — 0 when f — fp — 0, that is when fo is 
approaching the true global minimum. On the other hand, if f > fo, then 
g — 1, which means that all the modes well above the current minimum 
fo are suppressed. For a simple one-dimensional function, it is easy to see 
that such properties indeed preserve the loci of the function (see Fig. 3.4). 

As the loci of the minima are preserved, then all the modes that above 
the current lowest value fo are suppressed to some degree, while the modes 
below fp are expanded or amplified, which makes it easy for the system to 
“escape local modes. Simulations and studies suggest that it can significantly 

faprove the convergence for functions with complex landscape and modes. 
<Up to now we have not actually provided a detailed program to show 
ow the SA algorithm can be implemented in practice. However, before 
owe can actually do this, we need to find a practical way to deal with con- 
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straints, as most real-world optimization problems are constrained. In the 
next chapter, we will discuss in detail the ways of incorporating nonlinear 
constraints. 
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Chapter 4 


HOW TO DEAL WITH CONSTRAINTS 





The optimization we have discussed so far is unconstrained, as we have 
not considered any constraints. A natural and important question is how 
to incorporate the constraints (both inequality and equality constraints). 
There are mainly three ways to deal with constraints: direct approach, 
Lagrange multipliers, and penalty method. 

Direct approach intends to find the feasible regions enclosed by the con- 
straints. This is often difficult, except for a few special cases. Numeri- 
cally, we can generate a potential solution, and check if all the constraints 
are satisfied. If all the constraints are met, then it is a feasible solution, 
and the evaluation of the objective function can be carried out. If one 
or more constraints are not satisfied, this potential solution is discarded, 
and a new solution should be generated. We then proceed in a similar 
manner. As we can expect, this process is slow and inefficient. A better 
approach is to incorporate the constraints so as to formulate the problem 
as an unconstrained one. The method of Lagrange multiplier has rigorous 
mathematical basis, while the penalty method is simple to implement in 
practice. 


4.1 METHOD OF LAGRANGE MULTIPLIERS 


The method of Lagrange multipliers converts a constrained problem to an 
unconstrained one. For example, if we want to minimize a function 


meen” J(@),  & = (41, ...,2n)” ER", (4.1) 
subject to multiple nonlinear equality constraints 
gj(x) =0, FS 152,024 M- (4.2) 


iM'“AMo can use M Lagrange multipliers \;(7 = 1,...,.M) to reformulate the 
% j 
problem as the minimization of the following function 


at 


x 


Ss 





M 
L(a, Aj) = f(@) + >> Aj95(a). (4.3) 
j=l 
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The optimality requires that the following stationary conditions hold 











Ob OF 995 
Be; ~ Bag + De G5, (i =1,...,n), (4.4) 
and 3 
L : 
Dy > 9 =0, (j =1,...,M). (4.5) 


These M + n equations will determine the n components of « and M 
Lagrange multipliers. As oe = ;, we can consider A; as the rate of the 
z 


change of the quantity L(a, A;) as a functional of gj. 
Now let us look at a simple example 


maximize ¢ _ ,2/3,.1/3 
U,U =u Vv ’ 


subject to 
3sutv=9. 


First, we write it as an unconstrained problem using a Lagrange multiplier 
A, and we have 
L=u'/3y'/8 + \3u+v—9). 


The conditions to reach optimality are 


OL 2 -1/3,.1/3 OL 1 a7 _ 
ci 3A =0 — = <y?2/8y—2/8 = 
Au a ye? ; ao zu v +A=0, 
and ; 
1D 


The first two conditions give 20 = 3u, whose combination with the third 
condition leads to 
u= 2, v= 3. 


Thus, the maximum of f, is 12. 

Here we only discussed the equality constraints. For inequality con- 
straints, things become more complicated. We need the so-called Karush- 
Kuhn-Tucker conditions. 

Let us consider the following, generic, nonlinear optimization problem 


mpc f(a), 


subject to ¢;(~) = 0, (= 1,...,M), 


vj(z) <0, (j =1,...,N). (4.6) 
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If all the functions are continuously differentiable, at a local minimum z,, 
there exist constants Aj,...,A,¢ and [Uo, {1,---, 4 Such that the following 
KKT optimality conditions hold 


poV f (as ae Voi(as ies ue (w.) =0, (4.7) 
w=1 j=l 
and 
W;j(@%) 50, wy Yj (@*) = 0, (F = 1,2,...,.N), (4.8) 
where 


The last non-negativity conditions hold for all j;, though there is no con- 
straint on the sign of ;. 
The constants satisfy the following condition 


N M 
Soe + > lal > 0. (4.10) 
j=0 i=1 


This is essentially a generalized method of Lagrange multipliers. However, 
there is a possibility of degeneracy when fg = 0 under certain conditions. 
There are two possibilities: 1) there exist vectors A* = (Af,...,A%4,)7 and 
p* = (ut,..,u)? such that above equations hold, or 2) all the vectors 
Voi(ax), Voo(ax), 5 VW (ax), - 4 Vib (ata) are linearly independent, and 
in this case, the stationary conditions 2 be do not necessarily hold. As the 
second case is a special case, we will not discuss this further. 

The condition p;~;(x.) = 0 in (4.8) is often called the complementarity 
condition or complementary slackness condition. It either means 4; = 0 or 
w;(#.) = 0. The later case ~;(a.) = 0 for any particular 7 means the in- 
equality becomes tight, and thus becoming an equality. For the former case 
jij = 0, the inequality for a particular 7 holds and is not tight; however, 
tj = 0 means that this corresponding inequality can be ignored. There- 
fore, those inequalities that are not tight are ignored, while inequalities 
which are tight become equalities; consequently, the constrained problem 
with equality and inequality constraints now essentially becomes a mod- 
ified constrained problem with selected equality constraints. This is the 
beauty of the KKT conditions. The main issue remains to identify which 


x ,inequality becomes tight, and this depends on the individual optimization 
mS etahey,, 
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4.2 PENALTY METHOD 


For a nonlinear optimization problem with equality and inequality con- 
straints, a common method of incorporating constraints is the penalty 
method. For the optimization problem 


men” f(@), @ = (#1,..,2n)" ER", 


subject to ¢;(a) = 0, (¢=1,...,M), 


the idea is to define a penalty function so that the constrained problem is 
transformed into an unconstrained problem. Now we define 


N 
II (a, pi, 7) )+ » wih; (x) + S- YyWs (x), (4.12) 
j=l 


where pj > 1 and v; > 0 which should be large enough, depending on the 
solution quality needed. 

As we can see, when an equality constraint it met, its effect or contri- 
bution to II is zero. However, when it is violated, it is penalized heavily 
as it increases II significantly. Similarly, it is true when inequality con- 
straints become tight or exact. For the ease of numerical implementation, 
we should use index functions H to rewrite above penalty function as 


M 
x) + » MiHi{di(@ )+ +> vj HH. 03 (a), (4.13) 


Here H;[¢;(a)] and H;|[wW,;(a)] are index functions. 

More specifically, H;[¢;(a)] = 1 if ¢;(a@) # 0, and H; = 0 if d;(x) = 0. 
Similarly, H;[%;(x)] = 0 if w;(x) < 0 is true, while H; = 1 if ~,(x) > 0. 
In principle, the numerical accuracy depends on the values of ju; and v; 
which should be reasonably large. But how large is large enough? As most 
computers have a machine precision of € = 27°? = 2.2 x 107!®, yu; and v; 
should be close to the order of 10!°. Obviously, it could cause numerical 
problems if they are too large. 

In addition, for simplicity of implementation, we can use w = pu; for all 
i and vy = v; for all 7. That is, we can use a simplified 


M 
T(x, u,v) = f(x) +p S~ Hildi(w)] ase [w; (a) ]}W? (a). 


i=l j=l 
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Sometimes, it might be easier to change an equality constraint to two 
inequality constraints, so that we only have to deal with inequalities in the 
implementation. This is because g(a) = 0 is always equivalent to g(a) < 0 
and g(a) > 0 (or —g(a) < 0). 


4.3, STEP SIZE IN RANDOM WALKS 


As random walks are widely used for randomization and local search, a 
proper step size is very important. In the generic equation 


t+1 


git) =o +s e, (4.14) 


e, is drawn from a standard normal distribution with zero mean and unity 
standard deviation. Here the step size s determines how far a random 
walker (e.g., an agent or particle in metaheuristics) can go for a fixed 
number of iterations. 

If s is too large, then the new solution x‘! generated will be too far 
away from the old solution (or more often the current best). Then, such a 
move is unlikely to be accepted. If s is too small, the change is too small 
to be significant, and consequently such search is not efficient. So a proper 
step size is important to maintain the search as efficient as possible. 

From the theory of simple isotropic random walks, we know that the 
average distance r traveled in the d-dimension space is 


r? = 2dDt, (4.15) 


where D = s?/2r is the effective diffusion coefficient. Here s is the step 
size or distance traveled at each jump, and 7 is the time taken for each 
jump. The above equation implies that 


2 
2 Tr 


= —. 4.16 
fa (4.16) 
For a typical length scale L of a dimension of interest, the local search is 
typically limited in a region of L/10. That is, r = L/10. As the iterations 
are discrete, we can take 7 = 1. Typically in metaheuristics, we can expect 
that the number of generations is usually t = 100 to 1000, which means 
that 


~ 1 _ £/10 
ae ae (4.17) 


or. d = 1 and t = 100, we have s = 0.01L, while s = 0.001LZ for d = 10 
[+ = 1000. As step sizes could differ from variable to variable, a step 
Fatio s/L is more generic. Therefore, we can use s/L = 0.001 to 0.01 
gost problems. We will use this step size factor in our implementation, 


ta@be discussed later in the last section of this chapter. 
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In order to demonstrate the way we incorporate the constraints and the 
way to do the random walk, it is easy to illustrate using a real-world design 
example in engineering applications. Now let us look at the well-known 
welded beam design. 


4.4 WELDED BEAM DESIGN 


The welded beam design problem is a standard test problem for constrained 
design optimization, which was described in detail in the literature (Rags- 
dell and Phillips 1976, Cagnina et al 2008). The problem has four design 
variables: the width w and length L of the welded area, the depth d and 
thickness h of the beam. The objective is to minimize the overall fabri- 
cation cost, under the appropriate constraints of shear stress 7, bending 
stress 0, buckling load P and end deflection 6. 
The problem can be written as 


minimize f(a) = 1.10471w?L + 0.04811dh(14.0 + L), (4.18) 
subject to 
gi(a) = T(#) — 13,600 < 0 
g2(a@) = o(x) — 30,000 < 0 
g3(@) =w-h<o0 
ga(a) = 0.10471w? + 0.04811hd(14 + L) — 5.0 < 0 (4.19) 
g5(@) = 0.125 —w <0 
ge(x) = 6(x) — 0.25 < 0 
g7(a) = 6000 — P(a) < 0, 
where 
504, 000 65, 856 £ 
=. See = ae = 442 
ae ha” 30,000nds’  % = 8000(14 + 5), 


=i VEE wre, Javiwy + etO), pg. 2? 








2 J 
6000 aBL 
’ x) = 7 7 eee 2. 
aul Ta) = yar? +—— +B 
3 
/ 4 
P = 0.61423 x 10° a aie 2 ), (4.20) 


gh pettheans The simple limits or bounds are 0.1 < L,d < 10 and 0.1 < w,h < 2.0. 
«, If we use the simulated annealing algorithm to solve this problem (see 


ext section), we can get the optimal solution which is about the same 
tion obtained by Cagnina et al (2008) 






serpin? fie = 1.724852 at (0.205730, 3.470489, 9.036624, 0.205729). (4.21) 
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It is worth pointing out that you have to run the programs a few times 
using values such as a = 0.95 (default) and a = 0.99 to see how the results 
vary. In addition, as SA is a stochastic optimization algorithm, we cannot 
expect the results are the same. In fact, they will be slightly different, every 
time we run the program. Therefore, we should understand and interpret 
the results using statistical measures such as mean and standard deviation. 


4.5 SA IMPLEMENTATION 


We just formulated the welded beam design problem using different nota- 
tions from some literature. Here we try to illustrate a point. 

As the input to a function is a vector (either column vector or less often 
row vector), we have to write 


z=(w L d@ h)=([e(1) 2) 26) 24). (4.22) 
With this vector, the objective becomes 
minimize f(a) = 1.10471 * 2(1)? * x(2) + 0.04811 « #(3) « 2(4)(14.0 + 2(2)), 


which can easily be converted to a formula in Matlab. Similarly, the third 
inequality constraint can be rewritten as 


93 = 9(3) = (1) — (4) <0. (4.23) 


Other constraints can be rewritten in a similar way. 

Using the pseudo code for simulated annealing and combining with the 
penalty method, we can solve the above welded beam design problem using 
simulated annealing in Matlab as follows: 


% Simulated Annealing for constrained optimization 
% by Xin-She Yang @ Cambridge University ©2008 
% Usage: sa_mincon(0.99) or sa_mincon; 


function [bestsol,fval,N]=sa_mincon (alpha) 
% Default cooling factor 
if nargin<i, 

alpha=0.95; 


end 
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Ub=[2.0 10.0 10.0 2.0]; 
u0=(Lb+Ub) /2; 


if length(Lb) ~=length(Ub), 
disp(’Simple bounds/limits are improper!’) ; 
return 

end 


hh Start of the main program ------------------------- 
d=length(Lb) ; % Dimension of the problem 


% Initializing parameters and settings 


T_init = 1.0; % Initial temperature 

T_min = te-10; % Finial stopping temperature 
F_min = -1e+100; %, Min value of the function 
max_rej=500; % Maximum number of rejections 
max_run=150; % Maximum number of runs 
max_accept = 50; % Maximum number of accept 
initial_search=500; % Initial search period 

k = 1; % Boltzmann constant 
Enorm=1e-5; % Energy norm (eg, Enorm=1e-8) 


% Initializing the counters i,j etc 

i= 0; j = 0; accept = 0; totaleval = 0; 

% Initializing various values 

T = T_init; 

E_init = Fun(u0); 

E_old = E_init; E_new=E_old; 

best=u0; % initially guessed values 

% Starting the simulated annealing 

while ((T > T_min) & (j <= max_rej) & E_new>F_min) 


i = itt; 
% Check if max numbers of run/accept are met 
if (i >= max_run) | (accept >= max_accept) 


% reset the counters 
i= 1; accept = 1; 
% Cooling according to a cooling schedule 
T = cooling(alpha,T) ; 
end 


% Function evaluations at new locations 

if totaleval<initial_search, 
init_flag=1; 
ns=newsolution(u0,Lb,Ub, init_flag) ; 
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else 
init_flag=0; 
ns=newsolution(best ,Lb,Ub, init_flag) ; 

end 

totaleval=totaleval+1; 
E_new = Fun(ns); 

% Decide to accept the new solution 

DeltaE=E_new-E_old; 

% Accept if improved 

if (DeltaE <0) 
best = ns; E_old = E_new; 
accept=acceptti ; j = 0; 

end 

% Accept with a probability if not improved 

if (DeltaE>=0 & exp(-DeltaE/(k*T))>rand ); 
best = ns; E_old = E_new; 
accept=acceptt1; 

else 
j=jt1; 

end 

% Update the estimated optimal solution 

f_opt=E_old; 

end 


bestsol=best; 
fval=f_opt; 
N=totaleval; 


4, New solutions 

function s=newsolution(u0,Lb,Ub, init_flag) 
% Either search around 

if length(Lb)>0 & init_flag==1, 
s=Lb+(Ub-Lb) .*rand(size(u0)) ; 

else 
% Or local search by random walk 
stepsize=0.01; 
s=u0+stepsize*(Ub-Lb) .*randn(size(u0)) ; 

end 


(ales§=bounds (s ,Lb,Ub) ; 
cae on 







ion T=cooling(alpha,T) 
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function ns=bounds(ns,Lb, Ub) 

if length(Lb)>0, 

% Apply the lower bound 
ns_tmp=ns ; 
I=ns_tmp<Lb; 
ns_tmp(I)=Lb(1) ; 
% Apply the upper bounds 
J=ns_tmp>Ub; 
ns_tmp(J)=Ub(J) ; 

% Update this new move 
ns=ns_tmp; 

else 
ns=ns ; 

end 


% d-dimensional objective function 
function z=Fun(u) 


% Objective 
z=fobj(u); 


% Apply nonlinear constraints by penalty method 
% Z=f+sum_k=1°N lam_k g_k*2 *H(g_k) 
z=z+getnonlinear (u) ; 


function Z=getnonlinear (u) 
Z=0; 

% Penalty constant 
lam=10715; lameq=10715; 
[g,geq]=constraints(u) ; 


% Inequality constraints 
for k=1:length(g), 

Z=Z+ lam*g(k)*2*getH(g(k)) ; 
end 


% Equality constraints (when geq=[], length->0) 
for k=1:length(geq) , 
Z=Z+lameq*geq(k) ~2*geteqH (geq(k)) ; 






Test if inequalities hold 
Sunction H=getH(g) 
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end 


% Test if equalities hold 
function H=geteqH(g) 
if g==0, 
H=0; 
else 


end 


% Objective functions 

function z=fobj(u) 

% Welded beam design optimization 

z=1.10471*u(1) “2*u(2)+0.048114#u(3) *u(4) *(14.0+u(2)) ; 


% All constraints 

function [g,geq]=constraints (x) 

% Inequality constraints 

Q=6000* (14+x (2) /2) ; 

D=sqrt (x (2) *2/4+(x(1)+x(3))72/4) ; 

J=2 (x (1) *x (2) *sqrt (2) * (x (2) 72/12+ (x(1)+x(3))72/4)) ; 
alpha=6000/ (sqrt (2) *x(1)*x(2)); 

beta=Q+D/J; 

tau=sqrt (alpha*2+2*alpha*beta*x(2)/(2*D)+beta~2) ; 
sigma=504000/ (x (4) *x(3)*2) ; 

delta=65856000/ (30*10°6*x (4) *x (3) 73) ; 

tmpf=4 .013* (30*1076) /196; 

P=tmpf*sqrt (x (3) ~2*x (4) *6/36) *(1-x (3) *sqrt (30/48) /28) ; 


g(1)=tau-13600; 
g(2)=sigma-30000; 
g(3)=x(1)-x(A) ; 
g(4)=0.10471*x (1) *2+0.04811*x (3) x (4) *(14+x(2))-5.0; 
g(5)=0.125-x(1); 
g(6)=delta-0.25; 

neta (7) =6000-P ; 





40 CHAPTER 4. HOW TO DEAL WITH CONSTRAINTS 


How to Get the Files 


To get the files of all the Matlab programs provided in this book, readers 
can send an email (with the subject ‘Nature-Inspired Algorithms: Files’) 
to Metaheuristic.Algorithms@gmail.com — A zip file will be provided 
(via email) by the author. 
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Chapter 5 


GENETIC ALGORITHMS 





Genetic algorithms are probably the most popular evolutionary algorithms 
in terms of the diversity of their applications. A vast majority of well-known 
optimization problems have been tried by genetic algorithms. In addition, 
genetic algorithms are population-based, and many modern evolutionary 
algorithms are directly based on genetic algorithms, or have some strong 
similarities. 


5.1 INTRODUCTION 


The genetic algorithm (GA), developed by John Holland and his collabo- 
rators in the 1960s and 1970s, is a model or abstraction of biological evo- 
lution based on Charles Darwin’s theory of natural selection. Holland was 
the first to use the crossover and recombination, mutation, and selection 
in the study of adaptive and artificial systems. These genetic operators 
form the essential part of the genetic algorithm as a problem-solving strat- 
egy. Since then, many variants of genetic algorithms have been developed 
and applied to a wide range of optimization problems, from graph colour- 
ing to pattern recognition, from discrete systems (such as the travelling 
salesman problem) to continuous systems (e.g., the efficient design of air- 
foil in aerospace engineering), and from financial market to multiobjective 
engineering optimization. 

There are many advantages of genetic algorithms over traditional opti- 
mization algorithms, and two most noticeable advantages are: the ability 
of dealing with complex problems and parallelism. Genetic algorithms can 
deal with various types of optimization whether the objective (fitness) func- 
tion is stationary or non-stationary (change with time), linear or nonlinear, 
continuous or discontinuous, or with random noise. As multiple offsprings 

tee a population act like independent agents, the population (or any sub- 
ig) can explore the search space in many directions simultaneously. 
feature makes it ideal to parallelize the algorithms for implementa- 
. Different parameters and even different groups of encoded strings can 
e ‘manipulated at the same time. 
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Genetic Algorithm 





Objective function f(x), x = (21,...,2n)" 
Encode the solution into chromosomes (binary strings) 
Define fitness F (eg, F x f(x) for maximization) 
Generate the initial population 
Initial probabilities of crossover (p.) and mutation (Dm) 
while (¢ <Maz number of generations ) 
Generate new solution by crossover and mutation 
if p. >rand, Crossover; end if 
if p», >rand, Mutate; end if 
Accept the new solution if its fitness increases 
Select the current best for the next generation (elitism) 
end while 
Decode the results and visualization 





Figure 5.1: Pseudo code of genetic algorithms. 


However, genetic algorithms also have some minor disadvantages. The 
formulation of fitness function, the usage of population size, the choice of 
the important parameters such as the rate of mutation and crossover, and 
the selection criteria of new population should carefully be carried out. Any 
inappropriate choice will make it difficult for the algorithm to converge, or 
it simply produces meaningless results. Despite these, genetic algorithms 
remain one of the most widely used optimization algorithms in modern 
nonlinear optimization. 


5.2. GENETIC ALGORITHMS 


The essence of genetic algorithms involves the encoding of an optimization 
function as arrays of bits or character strings to represent the chromosomes, 
the manipulation operations of strings by genetic operators, and the selec- 
tion according to their fitness with the aim to find a solution to the problem 
concerned. 

This is often done by the following procedure: 1) encoding of the objec- 
tives or optimization functions; 2) defining a fitness function or selection 
criterion; 3) creating a population of individuals; 4) evolution cycle or it- 
erations by evaluating the fitness of all the individuals in the population, 


“gereating a new population by performing crossover, and mutation, fitness- 
> 


foportionate reproduction etc, and replacing the old population and iter- 
ang again using the new population; 5) decoding the results to obtain the 
Solution to the problem. These steps can schematically be represented as 
othe pseudo code of genetic algorithms shown in Fig. 5.1. 
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Child gene pair (after crossover) 


Figure 5.2: Diagram of crossover at a random crossover point 
(location) in genetic algorithms. 


One iteration of creating a new population is called a generation. The 
fixed-length character strings are used in most of genetic algorithms dur- 
ing each generation although there is substantial research on the variable- 
length strings and coding structures. The coding of the objective function 
is usually in the form of binary arrays or real-valued arrays in the adap- 
tive genetic algorithms. For simplicity, we use binary strings for encoding 
and decoding in our discussion. The genetic operators include crossover, 
mutation, and selection from the population. 

The crossover of two parent strings is the main operator with a higher 
probability p. and is carried out by swapping one segment of one chromo- 
some with the corresponding segment on another chromosome at a random 
position (see Fig. 5.2). The crossover carried out in this way is a single- 
point crossover. Crossover at multiple points is more often used in genetic 
algorithms to increase the evolutionary efficiency of the algorithms. 

The mutation operation is achieved by flopping the randomly selected 
bits (see Fig. 5.3), and the mutation probability p,, is usually small. 

The selection of an individual in a population is carried out by the 
evaluation of its fitness, and it can remain in the new generation if a certain 
threshold of the fitness is reached, we can also use that the reproduction of 
a population is fitness-proportionate. That is to say, the individuals with 
higher fitness are more likely to reproduce. 


5.3 CHOICE OF PARAMETERS 


‘a. important issue is the formulation or choice of an appropriate fitness 
éion that determines the selection criterion in a particular problem. 
the minimization of a function using genetic algorithms, one simple 
of constructing a fitness function is to use the simplest form F = A—y 
th A being a large constant (though A = 0 will do if the fitness needs 
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pa tiyotititojols | 


Original gene | (before mutation) 
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New gene (after crossover) 
Figure 5.3: Schematic representation of mutation at a single 
site by flipping a randomly selected bit (1 — 0). 


not to be non negative) and y = f(x), thus the objective is to maximize 
the fitness function and subsequently minimize the objective function f(x). 
However, there are many different ways of defining a fitness function. For 
example, we can use the individual fitness assignment relative to the whole 
population 


f(€)) 
F dy) SS omy 5.1 
a ea FE) oe 


where &; is the phenotypic value of individual 7, and N is the population 
size. The appropriate form of the fitness function will make sure that the 
solutions with higher fitness should be selected efficiently. Poor fitness 
function may result in incorrect or meaningless solutions. 

Another important issue is the choice of various parameters. The crossover 
probability p. is usually very high, typically in the range of 0.7 ~ 1.0. 
On the other hand, the mutation probability p,, is usually small (usually 
0.001 ~ 0.05). If p. is too small, then the crossover occurs sparsely, which 
is not efficient for evolution. If the mutation probability is too high, the 
solutions could still ‘jump around’ even if the optimal solution is approach- 
ing. 

A proper criterion for selecting the best solutions is also important. How 
to select the current population so that the best individuals with higher 
fitness should be preserved and passed onto the next generation. That is 
often carried out in association with certain elitism. The basic elitism is 
to select the most fit individual (in each generation) which will be carried 
over to the new generation without being modified by genetic operators. 
This ensures that the best solution is achieved more quickly. 

Other issues include the multiple sites for mutation and the use of var- 
ious population sizes. The mutation at a single site is not very efficient, 

Mes svuatation at multiple sites will increase the evolution efficiency. However, 
‘60 many mutants will make it difficult for the system to converge or even 
akes the system go astray to the wrong solutions. In real ecological sys- 
ems, if the mutation rate is too high under high selection pressure, then 
othe whole population might go extinct. 








5.3 CHOICE OF PARAMETERS 45 


Figure 5.4: Encode all design variables into a single long string. 


In addition, the choice of the right population size is also very important. 
If the population size is too small, there is not enough evolution going on, 
and there is a risk for the whole population to go extinct. In the real world, 
a species with a small population, ecological theory suggests that there is 
a real danger of extinction for such species. Even the system carries on, 
there is still a danger of premature convergence. In a small population, if a 
significantly more fit individual appears too early, it may reproduces enough 
offsprings so that they overwhelm the whole (small) population. This will 
eventually drive the system to a local optimum (not the global optimum). 
On the other hand, if the population is too large, more evaluations of the 
objective function are needed, which will require extensive computing time. 

Furthermore, more complex and adaptive genetic algorithms are under 
active research and the literature is vast about these topics. 

Using the basic procedure described in the above section, we can im- 
plement the genetic algorithms in any programming language. From the 
implementation point of view, there are two main ways to use encoded 
strings. We can either use a string (or multiple strings) for each individual 
in a population, or use a long string for all the individuals (or all design 
variables) shown in Fig. 5.4. 

For the well-known Easom function 


f(x) =—cos(x)e""-™”",, x € [-10, 10], (5.2) 


it has the global maximum fmax = 1 at x, = 7. The outputs from a typical 
run are shown in Fig. 5.5 where the top figure shows the variations of the 
best estimates as they approach to xz, — 7, while the lower figure shows 
the variations of the fitness function. 

Genetic algorithms are widely used and there are many software pack- 
ages in almost all programming languages. Efficient implementation can 
be found in both commercial and free codes. For this reason, we will not 
provide any implementation here. 
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Best estimate: « — 7 








Fitness: f(x) > 1 


0 5 10 15 


Figure 5.5: Typical outputs from a typical run. The best esti- 
mate will approach 7 while the fitness will approach fmax = 1. 
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Chapter 6 


DIFFERENTIAL EVOLUTION 





6.1 INTRODUCTION 


Differential evolution (DE) was developed by R. Storn and K. Price by 
their nominal papers in 1996 and 1997. It is a vector-based evolutionary 
algorithm, and can be considered as a further development to genetic al- 
gorithms. It is a stochastic search algorithm with self-organizing tendency 
and does not use the information of derivatives. Thus, it is a population- 
based, derivative-free method. Another advantage of differential evolution 
over genetic algorithms is that DE treats solutions as real-number strings, 
thus no encoding and decoding is needed. 

As in genetic algorithms, design parameters in a d-dimensional search 
space are represented as vectors, and various genetic operators are operated 
over their bits of strings. However, unlikely genetic algorithms, differential 
evolution carries out operations over each component (or each dimension of 
the solution). Almost everything is done in terms of vectors. For example, 
in genetic algorithms, mutation is carried out at one site or multiple sites 
of a chromosome, while in differential evolution, a difference vector of two 
randomly-chosen population vectors is used to perturb an existing vector. 
Such vectorized mutation can be viewed as a self-organizing search, directed 
towards an optimality. This kind of perturbation is carried out over each 
population vector, and thus can be expected to be more efficient. Similarly, 
crossover is also a vector-based component-wise exchange of chromosomes 
or vector segments. 


6.2 DIFFERENTIAL EVOLUTION 


For a d-dimensional optimization problem with d parameters, a population 
qv solution vectors are initially generated, we have x; where i = 1, 2,...,n. 
‘each solution x; at any generation t, we use the conventional notation 


xt = (25 4 B54, seat ga ys (6.1) 
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Figure 6.1: Schematic representation of mutation vectors in 
differential evolution with movement 6 = F(aq —2,). 


which consists of d-components in the d-dimensional space. This vector 
can be considered as the chromosomes or genomes. 

Differential evolution consists of three main steps: mutation, crossover 
and selection. 

Mutation is carried out by the mutation scheme. For each vector «x; at 
any time or generation t, we first randomly choose three distinct vectors 
Lp, L, and w, at t (see Fig. 6.1), and then generate a so-called donor 
vector by the mutation scheme 


vitt = ae + F(a — 2°), (6.2) 


where F' € [0,2] is a parameter, often referred to as the differential weight. 
This requires that the minimum number of population size is n > 4. In 
principle, F € [0,2], but in practice, a scheme with F € [0,1] is more 
efficient and stable. From Fig. 6.1, we can see that the perturbation 
6 = F(a,—,) to the vector a, is used to generate a donor vector v;, and 
such perturbation is directed and self-organized. 

The crossover is controlled by a crossover probability C;, € [0,1] and 
actual crossover can be carried out in two ways: binomial and exponen- 
tial. The binomial scheme performs crossover on each of the d components 
or variables/parameters. By generating a uniformly distributed random 
number r; € [0,1], the jth component of v; is manipulated as 


v;;, if n<?, ; 
ii = {2 athens j =1,2,...,d. (6.3) 
“zThis way, each component can be decided randomly whether to exchange 
ith donor vector or not. 
SIn the exponential scheme, a segment of the donor vector is selected and 
fis segment starts with a random k with a random length L which can 
“include many components. Mathematically, this is to choose k € [(0,d — 1] 
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Differential Evolution 





Initialize the population x with randomly generated solutions 
Set the weight F' € [0,2] and crossover probability C, € [0,1] 
while (stopping criterion) 
for i=1 ton, 
For each x;, randomly choose 3 distinct vectors ©), x, and x, 
Generate a new vector v by DE scheme (6.2) 
Generate a random index J, € {1,2,...,d} by permutation 
Generate a randomly distributed number r; € [0, 1] 
for j = 1 tod, 
For each parameter v;,, (jth component of v;), update 


ttl _ wt if 73 < Cy, or j = J, 
b x if ri > C, andj # J, 


end 
Select and update the solution by (6.5) 
end 
Update the counters such ast =t+1 
end 
Post-process and output the best solution found 





Figure 6.2: Pseudo code of differential evolution. 


and L € [1,d] randomly, and we have 


ve, for j=k,...,k-L+1€ [1d] 
uitt = (6.4) 
Di otherwise. 


As the binomial is simpler to implement, we will use the binomial crossover 
in our implementation. 

Selection is essentially the same as that used in genetic algorithms. It 
is to select the most fittest, and for minimization problem, the minimum 
objective value. Therefore, we have 


ait! = cs if fui") < F(w?), - 


x; otherwise. 


%A]) the above three components can be seen in the pseudo code as shown 
. 6.2. It is worth pointing out here that the use of J is to ensure 
4 ie # x‘, which may increase the evolutionary or exploratory effi- 
y. The overall search efficiency is controlled by two parameters: the 
ifferential weight F’ and the crossover probability C,. 
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6.3. VARIANTS 


Most studies have focused on the choice of F', C; and n as well as the 
modification of (6.2). In fact, when generating mutation vectors, we can 
use many different ways of formulating (6.2), and this leads to various 
schemes with the naming convention: DE/x/y/z where x is the mutation 
scheme (rand or best), y is the number of difference vectors, and z is the 
crossover scheme (binomial or exponential). The basic DE/Rand/1/Bin 
scheme is given in (6.2), that is 


vith wf, + F(a — 2%). (6.6) 


If we replace the ae by the current best Zpes; found so far, we have the 
so-called DE/Best/1/Bin scheme 


vit? = thes, + Flas — wi). (6.7) 


a 
There is no reason that why we should not use more than 3 distinct vectors. 
For example. if we use 4 different vectors plus the current best, we have 
the DE/Best /2/Bin scheme 


vt = oP ae a F(x}, + Li, > ,, = i.) (6.8) 


Furthermore, if we use five different vectors, we have the DE/Rand/2/Bin 
scheme 
vyt) = ah, + Fila, — 2,) + Fo(a, — 2h) (6.9) 


where F\ and F are differential weights in [0,1]. Obviously, for simplicity, 
we can also take fF, = Fo = F. 

Following the similar strategy, we can design various schemes. In fact, 
10 different schemes have been formulated, and for details, readers can refer 
to Price, Storn and Lampinin (2005). 


6.4 IMPLEMENTATION 


The implementation of differential evolution is relatively straightforward, 
comparing with genetic algorithms. If a vector /matrix-based software pack- 
age such as Matlab is used, the implementation becomes even more simple. 
A simple version of the DE in Matlab/Octave is given below, which is for 
unconstrained optimization where Rosenbrock’s function with three vari- 
ables is solved by default. To solve constrained optimization problems, this 
“program can easily be extended in combination with the penalty method 
scussed earlier when we solved the welded beam design problem. 


yt 


x 
& Differential Evolution for global optimization 
o% Programmed by Xin-She Yang @Cambridge University 2008 


6.4 IMPLEMENTATION 


% The basic version of scheme DE/Rand/1 is implemented 
% Usage: de(para) or de; 


function [best,fmin,N_iter]=de(para) 
% Default parameters 
if nargin<i, 

para=[10 0.7 0.9]; 


end 

n=para(1); % Population >=4, typically 10 to 25 
F=para(2) ; % DE parameter - scaling (0.5 to 0.9) 
Cr=para(3) ; % DE parameter - crossover probability 


% Iteration parameters 
tol=107(-5); % Stop tolerance 
N_iter=0; % Total number of function evaluations 


% Simple bounds 
Lb=[-1 -1 -1]; 
Ub=[2 2 2]; 


%, Dimension of the search variables 
d=length(Lb) ; 


% Initialize the population/solutions 
for i=1:n, 
Sol(i,:)=Lb+(Ub-Lb) .*rand(size(Lb)) ; 
Fitness (i)=Fun(Sol(i,:)); 
end 


% Find the current best 
[fmin,1I]=min(Fitness) ; 
best=Sol(I,:) 


% Start the iterations by differential evolution 

while (fmin>tol) 
% Obtain donor vectors by permutation 
ki=randperm(n) ; k2=randperm(n) ; 
k1sol=So0l1(k1,:); k2sol1=So01(k2,:); 

<< % Random crossover index/matrix 

K=rand(n,d)<Cr; 

% DE/RAND/1 scheme 

V=Sol1+F* (kisol-k2so1) ; 
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V=Sol.*(1-K)+V.*K; 


% Evaluate new solutions 
for i=1:n, 
Fnew=Fun(V(i,:)); 
% If the solution improves 
if Fnew<=Fitness(i), 
Sol(i,:)=V(i,:); 
Fitness (i)=Fnew; 
end 
% Update the current best 
if Fnew<=fmin, 
best=V(i,:);3 
fmin=Fnew; 
end 
end 
N_iter=N_iter+n; 
end 


% Output/display 
disp([’Number of evaluations: ’,num2str(N_iter)]); 
disp([’Best=’ ,num2str(best) ,’ fmin=’ ,num2str(fmin)]); 


% Objective function -- Rosenbrock’s 3D function 
function z=Fun(u) 
z=(1-u(1) )72+100* (u(2) -u(1) 72) 2+100* (u(3) -u (2) *2) 72; 
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Chapter 7 


ANT AND BEE ALGORITHMS 





From the discussion of genetic algorithms, we know that we can improve the 
search efficiency by using randomness which will also increase the diversity 
of the solutions so as to avoid being trapped in local optima. The selection 
of the best individuals is also equivalent to use memory. In fact, there 
are other forms of selection such as using chemical messenger (pheromone) 
which is commonly used by ants, honey bees, and many other insects. In 
this chapter, we will discuss the nature-inspired ant algorithms and bee 
algorithms. 


7.1. ANT ALGORITHMS 


7.1.1. Behaviour of Ants 


Ants are social insects in habit and they live together in organized colonies 

whose population size can range from about 2 to 25 millions. When for- 

aging, a swarm of ants or mobile agents interact or communicate in their 

local environment. Each ant can lay scent chemicals or pheromone so as 

to communicate with others, and each ant is also able to follow the route 

marked with pheromone laid by other ants. When ants find a food source, 

they will mark it with pheromone and also mark the trails to and from 

it. From the initial random foraging route, the pheromone concentration 

varies and the ants follow the route with higher pheromone concentration, 

and the pheromone is enhanced by the increasing number of ants. As more 

and more ants follow the same route, it becomes the favored path. Thus, 

some favorite routes emerge, often the shortest or more efficient. This is 
actually a positive feedback mechanism. 

Emerging behavior exists in an ant colony and such emergence arises 

from simple interactions among individual ants. Individual ants act ac- 

Rosen ding to simple and local information (such as pheromone concentration) 

Garry out their activities. Although there is no master ant overseeing 

entire colony and broadcasting instructions to the individual ants, or- 

ized behavior still emerges automatically. Therefore, such emergent 

eavior is similar to other self- organized phenomena which occur in many 
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processes in nature such as the pattern formation in animal skins (tiger 
and zebra skins). 

The foraging pattern of some ant species (such as the army ants) can 
show extraordinary regularity. Army ants search for food along some reg- 
ular routes with an angle of about 123° apart. We do not know how they 
manage to follow such regularity, but studies show that they could move in 
an area and build a bivouac and start foraging. On the first day, they for- 
age in a random direction, say, the north and travel a few hundred meters, 
then branch to cover a large area. The next day, they will choose a differ- 
ent direction, which is about 123° from the direction on the previous day 
and cover a large area. On the following day, they again choose a different 
direction about 123° from the second day’s direction. In this way, they 
cover the whole area over about 2 weeks and they move out to a different 
location to build a bivouac and forage in the new region. 

The interesting thing is that they do not use the angle of 360°/3 = 120° 
(this would mean that on the fourth day, they will search on the empty 
area already foraged on the first day). The beauty of this 123° angle is that 
it leaves an angle of about 10° from the direction on the first day. This 
means they cover the whole circular region in 14 days without repeating 
(or covering a previously-foraged area). This is an amazing phenomenon. 


7.1.2 Ant Colony Optimization 


Based on these characteristics of ant behaviour, scientists have developed 
a number of powerful ant colony algorithms with important progress made 
in recent years. Marco Dorigo pioneered the research in this area in 1992. 
Many different variants appear since then. 

If we only use some of the nature or the behavior of ants and add some 
new characteristics, we can devise a class of new algorithms. The basic 
steps of the ant colony optimization (ACO) can be summarized as the 
pseudo code shown in Fig. 7.1. 

There are two important issues here: the probability of choosing a route, 
and the evaporation rate of pheromone. There are a few ways of solving 
these problems, although it is still an area of active research. Here we 
introduce the current best method. 

For a network routing problem, the probability of ants at a particular 
node 7 to choose the route from node 7 to node j is given by 

a 78 
po (7.1) 
oh etahiety, i) i,j=l ord! ij 






ere @ > 0 and @ > 0 are the influence parameters, and their typical 
ues are @ & pe 2. BA is the pheromone concentration on the route 
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Ant Colony Optimization 





Objective function f(x), 2 = (21,...,2n)7 
[or f(xi;) for a routing problem where (i,j) € {1,2,...,n}/ 
Define pheromone evaporation rate 
while ( criterion ) 
for loop over all n dimensions (or nodes) 
Generate new solutions 
Evaluate the new solutions 
Mark better locations/routes with pheromone 564; 
Update pheromone: $4; — (1 — y)¢ijy + 66%; 
end for 
Find the current best 
end while 
Output the best results and pheromone distribution 





Figure 7.1: Pseudo code of ant colony optimization. 


d;; «x 1/s;;, which implies that shorter routes will be selected due to their 
shorter traveling time, and thus the pheromone concentrations on these 
routes are higher. This is because the traveling time is shorter, and thus 
the less amount of the pheromone has been evaporated during this period. 

This probability formula reflects the fact that ants would normally follow 
the paths with higher pheromone concentrations. In the simpler case when 
a = 6 =1, the probability of choosing a path by ants is proportional to 
the pheromone concentration on the path. The denominator normalizes 
the probability so that it is in the range between 0 and 1. 

The pheromone concentration can change with time due to the evapora- 
tion of pheromone. Furthermore, the advantage of pheromone evaporation 
is that the system could avoid being trapped in local optima. If there is 
no evaporation, then the path randomly chosen by the first ants will be- 
come the preferred path as the attraction of other ants by their pheromone. 
For a constant rate y of pheromone decay or evaporation, the pheromone 
concentration usually varies with time exponentially 


(t) = doe, (7.2) 


where @o is the initial concentration of pheromone and t is time. If yt < 1, 





ie =(1- 1) bi + Obi;, (7.3) 
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Figure 7.2: The double bridge problem for routing performance: 
route (2) is shorter than route (1). 


where ¥ € [0,1] is the rate of pheromone evaporation. The increment 6¢j, 
is the amount of pheromone deposited at time ¢t along route i to 7 when 
an ant travels a distance L. Usually 66%; x 1/L. If there are no ants on a 
route, then the pheromone deposit is zero. 

There are other variations to this basic procedure. A possible acceler- 
ation scheme is to use some bounds of the pheromone concentration and 
only the ants with the current global best solution(s) are allowed to de- 
posit pheromone. In addition, certain ranking of solution fitness can also 
be used. These are still topics of active research. 


7.1.3. Double Bridge Problem 


A standard test problem for ant colony optimization is the simplest double 
bridge problem with two branches (see Fig. 7.2) where route (2) is shorter 
than route (1). The angles of these two routes are equal at both point A 
and point B so that the ants have equal chance (or 50-50 probability) of 
choosing each route randomly at the initial stage at point A. 

Initially, fifty percent of the ants would go along the longer route (1) 
and the pheromone evaporates at a constant rate, but the pheromone con- 
centration will become smaller as route (1) is longer and thus takes more 
time to travel through. Conversely, the pheromone concentration on the 
shorter route will increase steadily. After some iterations, almost all the 
ants will move along the shorter route. Here we only use two routes at the 
node, it is straightforward to extend it to the multiple routes at a node. It 
is expected that only the shortest route will be chosen ultimately. As any 

“g¢omplex network system is always made of individual nodes, this algorithms 
‘an be extended to solve complex routing problems reasonably efficiently. 
fact, the ant colony algorithms have been successfully applied to the 
iternet routing problem, the traveling salesman problem, combinatorial 
eSptimization problems, and other NP-hard problems. 
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7.1.4 Virtual Ant Algorithm 


As we know that ant colony optimization has successfully solved NP- 
hard problems such as the traveling salesman problem, it can also be ex- 
tended to solve the standard optimization problems of multimodal func- 
tions. The only problem now is to figure out how the ants will move on an 
n-dimensional hyper-surface. For simplicity, we will discuss the 2-D case 
which can easily be extended to higher dimensions. On a 2D landscape, 
ants can move in any direction or 0° ~ 360°, but this will cause some 
problems. How to update the pheromone at a particular point as there are 
infinite number of points. One solution is to track the history of each ant 
moves and record the locations consecutively, and the other approach is to 
use a moving neighborhood or window. The ants ‘smell’ the pheromone 
concentration of their neighborhood at any particular location. 

In addition, we can limit the number of directions the ants can move by 
quantizing the directions. For example, ants are only allowed to move left 
and right, and up and down (only 4 directions). We will use this quantized 
approach here, which will make the implementation much simpler. Fur- 
thermore, the objective function or landscape can be encoded into virtual 
food so that ants will move to the best locations where the best food sources 
are. This will make the search process even more simpler. This simplified 
algorithm is called Virtual Ant Algorithm (VAA) developed by Xin-She 
Yang and his colleagues in 2006, which has been successfully applied to 
topological optimization problems in engineering. 

It is worth pointing out that ant colony algorithms are a good tool 
for combinatorial and discrete optimization. They have the advantages 
over other stochastic algorithms such as genetic algorithms and simulated 
annealing in dealing with dynamical network routing problems. 


7.2. BEE-INSPIRED ALGORITHMS 


Bee algorithms form another class of algorithms which are closely related 
to the ant colony optimization. Bee algorithms are inspired by the forag- 
ing behaviour of honeybees. Several variants of bee algorithms have been 
formulated, including the honeybee algorithm (HBA), the virtual bee algo- 
rithm (VBA), the artificial bee colony (ABC) optimization, the honeybee- 
mating Algorithm (HBMA) and others. 






ed colony. Honeybees can communicate by pheromone and ‘waggle 
nce’. For example, an alarming bee may release a chemical message 
_4heromone) to stimulate attack response in other bees. Furthermore, 
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when bees find a good food source and bring some nectar back to the 
hive, they will communicate the location of the food source by perform- 
ing the so-called waggle dances as a signal system. Such signaling dances 
vary from species to species, however, they will try to recruit more bees by 
using directional dancing with varying strength so as to communicate the 
direction and distance of the found food resource. 

For multiple food sources such as flower patches, studies show that a 
bee colony seems to be able to allocate forager bees among different flower 
patches so as to maximize their total nectar intake. In order to survive 
the winter, a bee colony typically has to collect and store extra nectar, 
about 15 to 50 kg. The efficiency of nectar collection is consequently very 
important from the evolution point of view. Experimental studies have 
also been carried out by researchers, including the important work by S$ 
Camazine, T Seeley and J Sney in early 1990s and lately by Quijano and 
K. Passino and their colleagues. Various algorithms can be designed if we 
learn from the natural behaviour of bee colonies. 


7.2.2. Bee Algorithms 


Over the last decade or so, nature-inspired bee algorithms have started to 
emerge as a promising and powerful tool. It is difficult to pinpoint the exact 
dates when the bee algorithms were first formulated. They were developed 
over a few years independently by several groups of researchers. 

From the literature survey, it seems that the honey bee algorithm (HBA) 
was first formulated in around 2004 by Craig A Tovey at Georgia Tech 
in collaboration with Sunil Nakrani then at Oxford University to study 
a method to allocate computers among different clients and web-hosting 
servers. Later in 2004 and earlier 2005, Xin-She Yang at Cambridge Uni- 
versity developed a virtual bee algorithm (VBA) to solve numerical opti- 
mization problems, and it can optimize both functions and discrete prob- 
lems, though only functions with two parameters were given as examples. 
Slightly later in 2005, Haddad and Afshar and their colleagues presented a 
honeybee-mating optimization (HBMO) algorithm which was subsequently 
applied to reservoir modelling and clustering. Around the same time, D 
Karaboga in Turkey developed an artificial bee colony (ABC) algorithm 
for numerical function optimization and a comparison study was carried 
by in the same group later in 2006 and 2008. These bee algorithms have 
an increasing popularity. 

The essence of bee algorithms are the communication or broadcasting 
ility of a bee to some neighbourhood bees so they can ‘know’ and follow 
bee to the best source, locations or routes to complete the optimization 
isk. The detailed implementation will depend on the actual algorithms, 
And they may differ slightly and vary with different variants. However, the 
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essence of all the bee algorithms can be summarized as the pseudo code in 
Fig. 7.3. 


7.2.3, Honeybee Algorithm 


In the honeybee algorithm, forager bees are allocated to different food 
sources (or flower patches) so as to maximize the total nectar intake. The 
colony has to ‘optimize’ the overall efficiency of nectar collection, the al- 
location of the bees is thus depending on many factors such as the nectar 
richness and the proximity to the hive. This problem is similar to the allo- 
cation of web-hosting servers in the Internet which was in fact one of the 
first problems solved using bee-inspired algorithms by Nakrani and Tovey 
in 2004. 

Let w;(j) be the strength of the waggle dance of bee i at time step t = J, 
the probability of an observer bee following the dancing bee to forage can 
be determined in many ways depending on the actual variant of algorithms. 
A simple way is given by 

wy 

= =, 7.4 
where ny is the number of bees in foraging process. ¢ is the pseudo time 
or foraging expedition. The number of observer bees is N — ng when N 
is the total number of bees. Alternatively, we can define an exploration 
probability of a Gaussian type pe = 1—p; = exp|—w?/207], where o is the 
volatility of the bee colony, and it controls the exploration and diversity 
of the foraging sites. If there is no dancing (no food found), then w; — 0, 
and p. = 1. So all the bee explore randomly. 

In other variant algorithms when applying to discrete problems such as 
job scheduling, a forager bee will perform waggle dance with a duration 
T = 7fp where f, is the profitability or the richness of the food site, and 
y is a scaling factor. The profitability should be related to the objective 
function. 

In addition, the rating of each route is ranked dynamically and the path 
with the highest number of bees become the preferred path. For a routing 
problem, the probability of choosing a route between any two nodes can 
take the form similar to the equation (7.1). That is 


ag 4g 
Pig = <n. cepa 5 (7.5) 
at we 3 


ce a > 0 and @ > 0 are the influence parameters. w;; is the dance 


rth along route i to j, and dj; is the desirability of the same route. 
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Bee Algorithms 





Objective function f(a), @ = (21,...,2%n)' © constraints 
Encode f(x) into virtual nectar levels 
Define dance routine (strength, direction) or protocol 
while ( criterion ) 

for loop over all n dimensions 

(or nodes for routing and scheduling problems) 

Generate new solutions 

Evaluate the new solutions 

end for 

Communicate and update the optimal solution set 
end while 
Decode and output the best results 





Figure 7.3: Pseudo code of bee algorithms 


scheduling. When dealing with continuous optimization problems, it is not 
straightforward, and some modifications are needed. 


7.2.4 Virtual Bee Algorithm 


The virtual bee algorithm (VBA), developed by Xin-She Yang in 2005, 
is an optimization algorithm specially formulated for solving both discrete 
and continuous problems. It has some similarity to the particle swarm opti- 
mization (PSO) to be discussed later, rather than a bee algorithm. In VBA, 
the continuous objective function is directly encoded as virtual nectar, and 
the solutions (or decision variables) are the locations of the nectar. The 
activities such as the waggle dance strength (or similar to the pheromone 
in the ant algorithms) are combined with the nectar concentration as the 
‘fitness’ of the solutions. For a maximization problem, the objective func- 
tion can be thought as virtual nectar, while for minimization, the nectar is 
formulated in such a way that the minimal value of the objective function 
corresponds to the highest nectar concentration. 

For discrete problems, the objective function such as the shorter paths 
are encoded and linked with the profitability of the nectar explorations, 
which is in turn linked with the dance strength of forager bees. In this way, 
the virtual bee algorithm is similar to the honeybee algorithm. However, 

“here is a fundamental difference from other bee algorithms. That is, VBA 
das a broadcasting ability of the current best. The current best location is 
own’ to every bee so that this algorithm is more efficient. In this way, 
yrager bees do not have to come back to the hive to tell other onlooker 
bees via ‘waggle dance’ so as to save time. Similar broadcasting ability is 
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also used in the particle swarm optimization, especially in the accelerated 
PSO algorithms. 

For a mixed type of problem when the decision variables can take both 
discrete and continuous values, the encoding of the objective function into 
nectar should be carefully implemented, so it can represent the objective 
effectively. This is still an active area of current research. 


7.2.5 Artificial Bee Colony Optimization 


The artificial bee colony (ABC) optimization algorithm was first developed 
by D. Karaboga in 2005. Since then, Karaboga and Basturk and their col- 
leagues have systematically studied the performance of the ABC algorithm 
concerning unstrained optimization problems and its extension. 

In the ABC algorithm, the bees in a colony are divided into three groups: 
employed bees (forager bees), onlooker bees (observer bees) and scouts. For 
each food source, there is only one employed bee. That is to say, the number 
of employed bees is equal to the number of food sources. The employed bee 
of an discarded food site is forced to become a scout for searching new food 
sources randomly. Employed bees share information with the onlooker bees 
in a hive so that onlooker bees can choose a food source to forage. Unlike 
the honey bee algorithm which has two groups of the bees (forager bees 
and observer bees), bees in ABC are more specialized. 

For a given objective function f(x), it can be encoded as F(a) to rep- 
resent the amount of nectar at location x, thus the probability P; of an 
onlooker bee choose to go to the preferred feud source at 2; can be defined 
by P; = F(xi)/03_ , F(a;) where S is the number of food sources. At a 
particular food source, the intake efficiency is determined by F/T’ where 
F is the amount of nectar and T is the time spent at the food source. If 
a food source is tried/foraged at a given number of explorations without 
improvement, then it is abandoned, and the bee at this location will move 
on randomly to explore new locations. 

Various applications have been carried out recently in the last few years, 
including the combinatorial optimization, job scheduling, web-hosting allo- 
cation, and engineering design optimization. For continuous decision vari- 
ables, their performance is still under active research. The handling of the 

onerandheromone requires substantial amount of memory and computing time. 
x »possible to eliminate the pheromone and just use the roaming agents? 
=. ‘answer is yes. Particle swarm optimization is just the right kind of 
nithm for such further modifications which will be discussed later in 
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Chapter 8 


SWARM OPTIMIZATION 





Particle swarm optimization (PSO) was developed by Kennedy and Eber- 
hart in 1995, based on the swarm behaviour such as fish and bird school- 
ing in nature. Since then, PSO has generated much wider interests, and 
forms an exciting, ever-expanding research subject, called swarm intelli- 
gence. PSO has been applied to almost every area in optimization, com- 
putational intelligence, and design/scheduling applications. There are at 
least two dozens of PSO variants, and hybrid algorithms by combining PSO 
with other existing algorithms are also increasingly popular. 


8.1 SWARM INTELLIGENCE 


Many algorithms such as ant colony algorithms and virtual ant algorithms 
use the behaviour of the so-called swarm intelligence. Particle swarm opti- 
mization may have some similarities with genetic algorithms and ant algo- 
rithms, but it is much simpler because it does not use mutation/crossover 
operators or pheromone. Instead, it uses the real-number randomness and 
the global communication among the swarm particles. In this sense, it is 
also easier to implement as there is no encoding or decoding of the param- 
eters into binary strings as those in genetic algorithms (which can also use 
real-number strings). 

This algorithm searches the space of an objective function by adjusting 
the trajectories of individual agents, called particles, as the piecewise paths 
formed by positional vectors in a quasi-stochastic manner. The movement 
of a swarming particle consists of two major components: a stochastic com- 
ponent and a deterministic component. Each particle is attracted toward 
the position of the current global best g* and its own best location x7 in 
history, while at the same time it has a tendency to move randomly. 

When a particle finds a location that is better than any previously found 
cations, then it updates it as the new current best for particle 7. There 
current best for all n particles at any time ¢ during iterations. The 
is to find the global best among all the current best solutions until the 
tive no longer improves or after a certain number of iterations. The 
~vement of particles is schematically represented in Fig. 8.1 where x is 
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possible i 
/ directions 0 @; 


particle i Ne 


Figure 8.1: Schematic representation of the motion of a particle 
in PSO, moving towards the global best g* and the current best 
x; for each particle 7. 


@ 


Particle Swarm Optimization 





Objective function f(x), x = (a1,...,%p)" 
Initialize locations x; and velocity v; of n particles. 
Find g* from min{f(a1),..., f(@n)} (at t = 0) 
while ( criterion ) 
t=t+1 (pseudo time or iteration counter) 
for loop over all n particles and all d dimensions 
Generate new velocity vit! using equation (8.1) 
Calculate new locations xt! = a! + v} 
Evaluate objective functions at new locations x 
Find the current best for each particle x; 
end for 
Fin the current global best g* 
end while 
Output the final results «* and g* 


t+1 


a 





Figure 8.2: Pseudo code of particle swarm optimization. 


the current best for particle i, and g* ~ min{f(a;)} for (@ = 1,2,...,n) is 
the current global best. 


8.2 PSO ALGORITHMS 


icine The essential steps of the particle swarm optimization can be summarized 
““¢ “as the pseudo code shown in Fig. 8.2. 





vit! = vf +06 0 [g* — a!) + Ber 0 [ej — a), (8.1) 


a 
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where €, and €9 are two random vectors, and each entry taking the values 
between 0 and 1. The Hadamard product of two matrices u © v is defined 
as the entrywise product, that is [wu © v]i; = uijvij- The parameters a and 
6 are the learning parameters or acceleration constants, which can typically 
be taken as, say, a © 3 & 2. 

The initial locations of all particles should distribute relatively uniformly 
so that they can sample over most regions, which is especially important 
for multimodal problems. The initial velocity of a particle can be taken as 
zero, that is, v{=°= 0. The new position can then be updated by 


eth = git ee (8.2) 


av; 


Although v; can be any values, it is usually bounded in some range [0, Umax]. 


8.3. ACCELERATED PSO 


There are many variants which extend the standard PSO algorithm, and 
the most noticeable improvement is probably to use inertia function 6(t) 
so that vi is replaced by 6(t)v! 


vi? = Ov; + ae: © [g* — wi] + Geo © [a} — aj], (8.3) 


where @ takes the values between 0 and 1. In the simplest case, the inertia 
function can be taken as a constant, typically 6 = 0.5 ~ 0.9. This is 
equivalent to introducing a virtual mass to stabilize the motion of the 
particles, and thus the algorithm is expected to converge more quickly. 

The standard particle swarm optimization uses both the current global 
best g* and the individual best 27. The reason of using the individual best 
is primarily to increase the diversity in the quality solutions, however, this 
diversity can be simulated using some randomness. Subsequently, there is 
no compelling reason for using the individual best, unless the optimization 
problem of interest is highly nonlinear and multimodal. 

A simplified version which could accelerate the convergence of the al- 
gorithm is to use the global best only. Thus, in the accelerated particle 
swarm optimization, the velocity vector is generated by a simpler formula 


vit) =v; + a(e — 1/2) + B(g* — 23), (8.4) 
‘ely out of convenience. We can also use a standard normal distribution 


, where €,, is drawn from N(0, 1) to replace the second term. The update 
position is simply 


et ae er, (8.5) 
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In order to increase the convergence even further, we can also write the 
update of the location in a single step 


att! — (1— B)axt + Bg* + a€n. (8.6) 


This simpler version will give the same order of convergence. The typical 
values for this accelerated PSO area + 0.1 ~ 0.4 and @ = 0.1 ~ 0.7, though 
a = 0.2 and @ = 0.5 can be taken as the initial values for most unimodal 
objective functions. It is worth pointing out that the parameters a and 3 
should in general be related to the scales of the independent variables x; 
and the search domain. 

A further improvement to the accelerated PSO is to reduce the random- 
ness as iterations proceed. This means that we can use a monotonically 
decreasing function such as 


a=age™, (8.7) 


or 
a=apoy’, (0<7 <1), (8.8) 


where ag ¥ 0.5 ~ 1 is the initial value of the randomness parameter. Here t 
is the number of iterations or time steps. 0 < y < 1 is a control parameter. 
For example, in our implementation, we will use 


a = 0.74, (8.9) 


where t € [0,10]. Obviously, these parameters are fine-tuned to suit the 
current optimization problems as a demonstration. 


8.4 IMPLEMENTATION 


The accelerated particle swarm optimization has been implemented using 
both Matlab and Octave, and a simple program is provided below. This 
program can find the global optimal solution of most nonlinear functions 
in less a minute on most modern personal computers. 

Now let us look at the 2D Michalewicz function 


f(x,y) = —{ sin(a)[sin(—)P" + sin(y) [sin yP"}, 


where m = 10. The stationary conditions f; = fy = 0 require that 


(Metaher,. 4 2 : 
y ns ——" rsin(z) cos(—) — cos(z) sin(—) = 0, 
7 wT os 


at 


x 








8 2y? 2y” 
lls sin(2) cos(—) — cos(y) sin(—”-) = 0. 
TT WT TT 
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Figure 8.3: Michalewicz’s function with a global minimum at about 
(2.20319, 1.57049). 


The solution at (0,0) is trivial, and the minimum f* + —1.801 occurs at 
about (2.20319,1.57049) (see Fig. 8.3). 


% The Particle Swarm Optimization 

% (written by X S Yang, Cambridge University) 

/% Usage: pso(number_of_particles,Num_iterations) 

h eg: best=pso_demo(20,10) ; 

% where best=[xbest ybest zbest] Zan n by 3 matrix 
%  xbest(i)/ybest(i) are the best at ith iteration 


function [best]=pso_simpledemo(n,Num_iterations) 

% n=number of particles 

%, Num_iterations=total number of iterations 

if nargin<2, Num_iterations=10; end 

if nargin<i, n=20; end 

% Michalewicz Function f*=-1.801 at [2.20319,1.57049] 

% Splitting two parts to avoid long lines in printing 

stri=‘-sin(x)*(sin(x*2/3.14159))~20’; 

str2=‘-sin(y) *(sin(2*y*2/3.14159) )~20’; 

funstr=strcat(stri,str2); 

% Converting to an inline function and vectorization 
ee 







=[0 40 4]; 


setting the Par omehenes aries bets 
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% alpha=gamma“t=0.7°t; 

% Speed of convergence (0->1)=(slow->fast) 

beta=0.5; 

ee a ee 
% Grid values of the objective function 

% These values are used for visualization only 
Ngrid=100; 

dx= (range (2)-range(1))/Ngrid; 

dy= (range (4) -range (3) )/Ngrid; 
xgrid=range(1):dx:range(2); ygrid=range(3) :dy:range(A4) ; 
(x, y]=meshgrid(xgrid,ygrid) ; 

z=f (x,y); 

% Display the shape of the function to be optimized 
figure(1); 

surfc(x,y,z); 


best=zeros(Num_iterations,3); % initialize history 

ho 7-777 Start Particle Swarm Optimization ----------- 

% generating the initial locations of n particles 

(xn, yn]=init_pso(n, range) ; 

% Display the paths of particles in a figure 

% with a contour of the objective function 
figure(2) ; 

% Start iterations 

for i=1:Num_iterations, 

% Show the contour of the function 
contour(x,y,z,15); hold on; 

% Find the current best location (xo,yo) 

zn=f (xn, yn) ; 

zn_min=min(zn) ; 

xo=min(xn(zn==zn_min)) ; 

yo=min(yn(zn==zn_min)) ; 

zo=min(zn(zn==zn_min)) ; 

% Trace the paths of all roaming particles 

% Display these roaming particles 

plot(xn,yn,‘.’,xo,yo,‘*’); axis(range) ; 

% The accelerated PSO with alpha=gamma“t 
gamma=0.7; alpha=gamma.~i; 

% Move all the particles to new locations 


(xn, yn]=pso_move(xn,yn,xo,yo,alpha, beta,range) ; 
drawnow; 

% Use "hold on" to display paths of particles 
hold off; 
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Figure 8.4: Initial and final locations of 20 particles after 10 iterations. 


function [xn,yn]=init_pso(n,range) 
xrange=range(2)-range(1); yrange=range(4)-range(3) ; 
xn=rand(1,n)*xrangetrange(1) ; 
yn=rand(1,n)*yrangetrange (3) ; 
% Move all the particles toward (xo,yo) 
function [xn,yn]=pso_move(xn,yn,xo,yo,a,b,range) 
nn=size(yn,2); %a=alpha, b=beta 
xn=xn.*(1-b)+xo.*bta.*(rand(1,nn)-0.5); 
yn=yn.*(1-b)+yo.*bta.*(rand(1,nn)-0.5); 
(xn, yn] =findrange(xn,yn, range) ; 
4 Make sure the particles are within the range 
function [xn,yn]=findrange(xn,yn,range) 
nn=length(yn) ; 
for i=i:nn, 

if xn(i)<=range(1), xn(i)=range(1); end 

if xn(i)>=range(2), xn(i)=range(2); end 

if yn(i)<=range(3), yn(i)=range(3); end 

if yn(i)>=range(4), yn(i)=range(4); end 
end 


If we run the program, we will get the global optimum after about 200 


evaluations of the objective function (for 20 particles and 10 iterations). 
The results and the locations of the particles are shown in Fig. 8.4. 


8.5 CONVERGENCE ANALYSIS 








‘9m the statistical point of view, each particle in PSO forms a Markov 

ia, though this Markov chain is biased towards to the current best, as 
bikibion probability often leads to the acceptance of the move towards 
current global best. In addition, the multiple Markov chains are inter- 
ching j in terms of partly deterministic attraction movement. Therefore, the 
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mathematical analysis concerning of the rate of convergence of PSO is very 
difficult, if not impossible. There is no doubt that any theoretical advance 
in understanding multiple interacting Markov chains will gain tremendous 
insightful in understanding how the PSO behaves and may consequently 
lead to the design of better or new PSO algorithms. 

Mathematically, if we ignore the random factors, we can view the system 
formed by (8.1) and (8.2) as a dynamical system. If we focus on a single 
particle 7 and image there is only one particle in this system, then the 
global best g* is the same as its current best x7. In this case, we have 

vytsutyg*—2), y=at8B, (8.10) 
and 
pitt — ot + yitt, (8.11) 
Following an analysis of 1D dynamical system for particle swarm opti- 
mization by Clerc and Kennedy (2002), we can replace g* by a parameter 
constant p so that we can see if or not the particle of interest will converge 
towards p. Now we can write the above system as a simple dynamical 
system 


u(t + 1) = v(t) + y(p — x(t), x(t+1)=a(t)+v(t+1). (8.12) 


For simplicity, we only focus on a single particle. By setting u, = p—a(t+1) 
and using the notations for dynamical systems, we have 


Vt+1 = Ve + YUE; Ut+1 = —Ue + (1 a vue, (8.13) 


1 v 


The general solution of this dynamical system can be written as 


or 


Y; = Yo exp[At]. (8.15) 
The main behaviour of this system can be characterized by the eigenvalues 


rA of A 
Jy—4 
pene ase ee a (8.16) 








2 2 


It can be seen clearly that ~y = 4 leads to a bifurcation. 
Following a straightforward analysis of this dynamical system, we can 
iM teur Have three cases. For 0 < y < 4, cyclic and/or quasi-cyclic trajectories ex- 
. In this case, when randomness is gradually reduced, some convergence 
a be observed. For y > 4, non-cyclic behaviour can be expected and 
je distance from Y; to the center (0,0) is monotonically increasing with 
oO. Ina special case y = 4, some convergence behaviour can be observed. 


Ku 


& 
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For detailed analysis, please refer to Clerc and Kennedy (2002). Since p is 
linked with the global best, as the iterations continue, it can be expected 
that all particles will aggregate towards the the global best. 

Various studies show that PSO algorithms can outperform genetic al- 
gorithms and other conventional algorithms for solving many optimization 
problems. This is partially due to that fact that the broadcasting ability of 
the current best estimates gives a better and quicker convergence towards 
the optimality. However, PSO algorithms are almost memoryless since they 
do not record the movement paths of each particle, and it is expected that 
it can be further improved using short-term memory in the similar fashion 
as that in Tabu search. Further development is under active research. 


REFERENCES 


1. A. Chatterjee and P. Siarry, Nonlinear inertia variation for dynamic adap- 
tation in particle swarm optimization, Comp. Oper. Research, 33, 859-871 
(2006). 


2. M. Clerc, J. Kennedy, The particle swarm - explosion, stability, and con- 
vergence in a multidimensional complex space, JEEE Trans. Evolutionary 
Computation, 6, 58-73 (2002). 


3. A. P. Engelbrecht, Fundamentals of Computational Swarm Intelligence, Wi- 
ley, 2005. 


4. J. Kennedy and R. C. Eberhart, Particle swarm optimization, in: Proc. of 
IEEE International Conference on Neural Networks, Piscataway, NJ. pp. 
1942-1948 (1995). 


5. J. Kennedy, R. C. Eberhart, Swarm intelligence, Academic Press, 2001. 


6. Swarm intelligence, http://www.swarmintelligence.org 








Chapter 9 


HARMONY SEARCH 





9.1 HARMONICS AND FREQUENCIES 


Harmony Search (HS) is a relatively new heuristic optimization algorithm 
and it was first developed by Z. W. Geem et al. in 2001. Since then, it 
has been applied to solve many optimization problems including function 
optimization, water distribution network, groundwater modelling, energy- 
saving dispatch, structural design, vehicle routing, and others. The possi- 
bility of combining harmony search with other algorithms such as particle 
swarm optimization and genetic algorithms has also been investigated. 

Harmony search is a music-inspired metaheuristic optimization algo- 
rithm. It is inspired by the observation that the aim of music is to search 
for a perfect state of harmony. This harmony in music is analogous to find 
the optimality in an optimization process. The search process in optimiza- 
tion can be compared to a musician’s improvisation process. This perfectly 
pleasing harmony is determined by the audio aesthetic standard. 

The aesthetic quality of a musical instrument is essentially determined 
by its pitch (or frequency), timbre (or sound quality), and amplitude (or 
loudness). Timbre is largely determined by the harmonic content which is 
in turn determined by the waveforms or modulations of the sound signal. 
However, the harmonics it can generate will largely depend on the pitch or 
frequency range of the particular instrument. 

Different notes have different frequencies. For example, the note A above 
middle C (or standard concert A4) has a fundamental frequency of fo = 440 
Hz. As the speed of sound in dry air is about v = 331 + 0.6T m/s where 
T is the temperature in degrees Celsius near 0°C. So at room temperature 
T = 20°C, the A4 note has a wavelength A = v/fo + 0.7795 m. When we 
adjust the pitch, we are in fact trying to change the frequency. In music 
theory, pitch p in MIDI is often represented as a numerical scale (a linear 

go metaapitch space) using the following formula 


at 








& 
fa 
= 694 121 9.1 
p= 69 + 12log,( 7), (9.1) 
f = 440 x 2(7-99)/12, (9.2) 
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Figure 9.1: Harmony of two notes with a frequency ratio of 2:3 
and their waveform. 


which means that the A4 notes has a pitch number 69. In this scale, octaves 
correspond to size 12 and semitone corresponds to size 1. Furthermore, the 
ratio of frequencies of two notes which are an octave apart is 2:1. Thus, 
the frequency of a note is doubled (halved) when it raised (lowered) by 
an octave. For example, A2 has a frequency of 110Hz, while A5 has a 
frequency of 880Hz. 

The measurement of harmony when different pitches occurring simulta- 
neously, like any aesthetic quality, is somewhat subjective. However, it is 
possible to use some standard estimation for harmony. The frequency ra- 
tio, pioneered by ancient Greek mathematician Pythagoras, is a good way 
for such estimations. For example, the octave with a ratio of 1:2 sounds 
pleasant when playing together, so are the notes with a ratio of 2:3 (see 
Fig. 9.1). However, it is unlikely for any random notes such as those shown 
in 9.2 to produce a pleasant harmony. 


9.2 HARMONY SEARCH 


Harmony search can be explained in more detail with the aid of the dis- 
cussion of the improvisation process by a musician. When a musician is 
improvising, he or she has three possible choices: (1) play any famous 
piece of music (a series of pitches in harmony) exactly from his or her 
memory; (2) play something similar to a known piece (thus adjusting the 
pitch slightly); or (3) compose new or random notes. If we formalize these 
three options for optimization, we have three corresponding components: 
usage of harmony memory, pitch adjusting, and randomization. 

The usage of harmony memory is important as it is similar to choose 
~dhe best fit individuals in the genetic algorithms. This will ensure the best 
jarmonies will be carried over to the new harmony memory. In order to 
se this memory more effectively, we can assign a parameter Taccept € 0, 1], 
ulled harmony memory accepting or considering rate. If this rate is too 
wlow, only few best harmonies are selected and it may converge too slowly. 
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Figure 9.2: Random music notes. 


If this rate is extremely high (near 1), almost all the harmonies are used in 
the harmony memory, then other harmonies are not explored well, leading 
to potentially wrong solutions. Therefore, typically, raccept = 0.7 ~ 0.95. 

To adjust the pitch slightly in the second component, we have to use a 
method such that it can adjust the frequency efficiently. In theory, the pitch 
can be adjusted linearly or nonlinearly, but in practice, linear adjustment 
is used. If %oiq is the current solution (or pitch), then the new solution 
(pitch) apew is generated by 


Znew = Lold + bp (2 rand — 1), (9.3) 


where rand is a random number drawn from a uniform distribution [0, 1]. 
Here by is the bandwidth, which controls the local range of pitch adjust- 
ment. In fact, we can see that the pitch adjustment (9.3) is a random walk. 


Pitch adjustment is similar to the mutation operator in genetic algo- 
rithms. We can assign a pitch-adjusting rate (rp) to control the degree of 
the adjustment. If rp, is too low, then there is rarely any change. If it is 
too high, then the algorithm may not converge at all. Thus, we usually use 
Tpa = 0.1 ~ 0.5 in most simulations. 

The third component is the randomization, which is to increase the 
diversity of the solutions. Although adjusting pitch has a similar role, but 
it is limited to certain local pitch adjustment and thus corresponds to a 
local search. The use of randomization can drive the system further to 
explore various regions with high solution diversity so as to find the global 
optimality. So we have 


Pa = Plowerlimit =f Prange * rand, (9.4) 


where Prange = Pupperlimit — Plowerlimit. Here rand is a random number 
generator in the range of 0 and 1. 
The three components in harmony search can be summarized as the 
pseudo code shown in Fig. 9.3 where we can see that the probability of 
wolidene randomization is 


af 


x 





Prandom = 1 — Taccept> (9.5) 


Pitch = Taccept * Tpa: (9.6) 
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Harmony Search 





Objective function f(x), w= (21,...,0p)" 
Generate initial harmonics (real number arrays) 
Define pitch adjusting rate (Tpa) and pitch limits 
Define harmony memory accepting rate (Taccept) 
while (t <Max number of iterations ) 
Generate new harmonics by accepting best harmonics 
Adjust pitch to get new harmonics (solutions) 
if (rand> Taccept); 
choose an existing harmonic randomly 
else if (rand> rpa), 
adjust the pitch randomly within a bandwidth (9.3) 
else 
generate new harmonics via randomization (9.4) 
end if 
Accept the new harmonics (solutions) if better 
end while 
Find the current best estimates 





Figure 9.3: Pseudo code of Harmony Search. 


Furthermore, like genetic algorithms and particle swarm optimization, har- 
mony search is not a gradient-based search, so it avoids most of the pitfalls 
of any gradient-based search algorithms. Thus, it has fewer mathematical 
requirements, and subsequently, it can be used to deal with complex ob- 
jective functions whether continuous or discontinuous, linear or nonlinear, 
or stochastic with noise. 

On the other hand, harmony search could be potentially more efficient 
than genetic algorithms because harmony search does not use binary en- 
coding and decoding, but it does have multiple solution vectors. Therefore, 
HS can be faster during each iteration. The implementation of HS algo- 
rithm is also easier. In addition, there is evidence to suggest that HS is 
less sensitive to the chosen parameters, which means that we do not have 
to fine-tune these parameters to get quality solutions. 


9.3. IMPLEMENTATION 


oF sing the three components described in above section, we can implement 
ve harmony search algorithm in Matlab/Octave. 
=For Rosenbrock’s banana function 


f(z,y) = (1—2)? + 100(y — 2”), (9.7) 
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Figure 9.4: The variations of harmonies in harmony search. 


within the domain 
(x,y) € [—10, 10] x [—10, 10], (9.8) 


it has the global minimum fin = 0 at (1,1). The following Matlab/Octave 
program can be used to find its optimality. 


% Harmony Search (Simple Demo) Matlab Program 
/% Written by X S Yang (Cambridge University) 
/% Usage: hs_simple 


% or hs_simple(‘x*2+(y-5) 2’ ,25000) ; 
function [solution,fbest]=hs_simple(funstr,MaxAttempt) 
disp(‘It may take a few minutes ...’); 


% MaxAttempt=25000; % Max number of Attempt 
if nargin<2, MaxAttempt=25000; end 

if nargin<i, 

% Rosenbrock’s Banana function with the 

% global fmin=0 at (1,1). 

funstr = ‘(1-x1)72+100* (x2-x172)*2’; 

end 

% Converting to an inline function 
f=vectorize(inline(funstr)) ; 

ndim=2; ‘Number of independent variables 
she range of the objective function 
:)=[-10 101; range(2, Je [-10 10]; 


ange=[200 200]; 
Initial parameter setting 
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HS_size=20; %Length of solution vector 
HMacceptRate=0.95; %HM Accepting Rate 
PArate=0.7; “Pitch Adjusting rate 


% Generating Initial Solution Vector 
for i=1:HS_size, 
for j=1:ndim, 
x(j)=range(j,1)+(range(j,2)-range(j,1))*rand; 
end 
HM(i, :) = x; 
HMbest (i) = f(x(1), x(2)); 
end %/ for i 
% Starting the Harmony Search 
for count = 1:MaxAttempt, 
for j = i:ndim, 
if (rand >= HMacceptRate) 
% New Search via Randomization 
x(j)=range(j,1)+(range(j,2)-range(j,1))*rand; 
else 
% Harmony Memory Accepting Rate 
x(j) = HM(fix(HS_size*rand)+1,j); 
if (rand <= PArate) 
% Pitch Adjusting in a given range 
pa=(range(j,2)-range(j,1))/pa_range(j); 
x(j)= x(j)+pa*(rand-0.5); 
end 
end 
end %/ for j 
% Evaluate the new solution 
fbest = f(x(1), x(2)); 
% Find the best in the HS solution vector 
HSmaxNum = 1; HSminNum=1; 
HSmax = HMbest(1); HSmin=HMbest (1) ; 
for i = 2:HS_size, 
if HMbest(i) > HSmax, 
HSmaxNum = i; 
HSmax = HMbest(i); 
end 
if HMbest(i)<HSmin, 
HSminNum=i ; 
HSmin=HMbest (i) ; 
end 


% Updating the current solution if better 
if fbest < HSmax, 
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HM(HSmaxNum, :) = x; 
HMbest (HSmaxNum) = fbest; 
end 
solution=x; % Record the solution 
end %/ (end of harmony search) 


The best estimate solution (1.005, 1.0605) is obtained after 25000 itera- 
tions. On a modern desktop computer, it usually takes less than a minute. 
The variations of these solutions are shown in Fig. 9.4. 

We have used Taccept =HMacceptRate= 0.95, and the pitch adjusting 
rate Tpa =PArate= 0.7. From Fig. 9.4, we can see that since the pitch 
adjustment is more intensive in local regions (two thin strips), it indeed 
indicates that the harmony search is more efficient than genetic algorithms. 
However, such comparison for different types of problem is still an area of 
active research. 

Harmony search is emerging as a powerful algorithm, and its relevant 
literature is expanding. It is still an interesting area of active research. 
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Chapter 10 


FIREFLY ALGORITHM 





10.1 BEHAVIOUR OF FIREFLIES 


The flashing light of fireflies is an amazing sight in the summer sky in 
the tropical and temperate regions. There are about two thousand firefly 
species, and most fireflies produce short and rhythmic flashes. The pat- 
tern of flashes is often unique for a particular species. The flashing light is 
produced by a process of bioluminescence, and the true functions of such 
signaling systems are still being debated. However, two fundamental func- 
tions of such flashes are to attract mating partners (communication), and 
to attract potential prey. In addition, flashing may also serve as a protec- 
tive warning mechanism to remind potential predators of the bitter taste 
of fireflies. 

The rhythmic flash, the rate of flashing and the amount of time form 
part of the signal system that brings both sexes together. Females respond 
to a male’s unique pattern of flashing in the same species, while in some 
species such as Photuris, female fireflies can eavesdrop on the biolumines- 
cent courtship signals and even mimic the mating flashing pattern of other 
species so as to lure and eat the male fireflies who may mistake the flashes 
as a potential suitable mate. Some tropic fireflies can even synchronize 
their flashes, thus forming emerging biological self-organized behavior. 

We know that the light intensity at a particular distance r from the light 
source obeys the inverse square law. That is to say, the light intensity I 
decreases as the distance r increases in terms of J x 1/r?. Furthermore, 
the air absorbs light which becomes weaker and weaker as the distance 
increases. These two combined factors make most fireflies visual to a limit 
distance, usually several hundred meters at night, which is good enough 
for fireflies to communicate. 

The flashing light can be formulated in such a way that it is associated 
os evarangith the objective function to be optimized, which makes it possible to 
rf aulate new optimization algorithms. In the rest of this chapter, we will 
“outline the basic formulation of the Firefly Algorithm (FA) and then 
iss the implementation in detail. 
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Firefly Algorithm 





Objective function f(x), = (Pin cata) 
Generate initial population of fireflies x; (i = 1,2,...,n) 
Light intensity I; at x; is determined by f(x;) 
Define light absorption coefficient 
while (t <MaxGeneration) 
for i=1:n all n fireflies 
for j =1:n all n fireflies (inner loop) 
if (I; < 1;), Move firefly i towards j; end if 
Vary attractiveness with distance r via exp|—yr] 
Evaluate new solutions and update light intensity 
end for j 
end for 1 
Rank the fireflies and find the current global best g,. 
end while 
Postprocess results and visualization 





Figure 10.1: Pseudo code of the firefly algorithm (FA). 


10.2. FIREFLY ALGORITHM 


Now we can idealize some of the flashing characteristics of fireflies so as 
to develop firefly-inspired algorithms. For simplicity in describing our new 
Firefly Algorithm (FA) which was developed by Xin-She Yang at Cam- 
bridge University in 2007, we now use the following three idealized rules: 


e All fireflies are unisex so that one firefly will be attracted to other 
fireflies regardless of their sex; 


e Attractiveness is proportional to the their brightness, thus for any two 
flashing fireflies, the less brighter one will move towards the brighter 
one. The attractiveness is proportional to the brightness and they 
both decrease as their distance increases. If there is no brighter one 
than a particular firefly, it will move randomly; 


e The brightness of a firefly is affected or determined by the landscape 
of the objective function. 






For a maximization problem, the brightness can simply be proportional 
‘@ the value of the objective function. Other forms of brightness can be 
@fined in a similar way to the fitness function in genetic algorithms. 
> Based on these three rules, the basic steps of the firefly algorithm (FA) 
SS . é : 

“can be summarized as the pseudo code shown in Figure 11.1. 
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10.3. LIGHT INTENSITY AND ATTRACTIVENESS 


In the firefly algorithm, there are two important issues: the variation of 
light intensity and formulation of the attractiveness. For simplicity, we 
can always assume that the attractiveness of a firefly is determined by its 
brightness which in turn is associated with the encoded objective function. 
In the simplest case for maximum optimization problems, the brightness 
I of a firefly at a particular location a can be chosen as I(x) « f(x). 
However, the attractiveness @ is relative, it should be seen in the eyes 
of the beholder or judged by the other fireflies. Thus, it will vary with 
the distance r;; between firefly 7 and firefly j. In addition, light intensity 
decreases with the distance from its source, and light is also absorbed in 
the media, so we should allow the attractiveness to vary with the degree of 
absorption. 
In the simplest form, the light intensity I(r) varies according to the 
inverse square law 
I(r) = a (10.1) 
r 
where J, is the intensity at the source. For a given medium with a fixed 
light absorption coefficient y, the light intensity I varies with the distance 
r. That is 
L=Ihe™, (10.2) 
where Jg is the original light intensity. In order to avoid the singularity 
at r = 0 in the expression I,/r?, the combined effect of both the inverse 
square law and absorption can be approximated as the following Gaussian 
form : 
iirjeaibe (10.3) 
As a firefly’s attractiveness is proportional to the light intensity seen by 
adjacent fireflies, we can now define the attractiveness @ of a firefly by 


B= Boe™, (10.4) 


where { is the attractiveness at r = 0. As it is often faster to calculate 
1/(1 +r) than an exponential function, the above function, if necessary, 
can conveniently be approximated as 


p=. 

yr 

Both (10.4) and (10.5) define a characteristic distance ! = 1/,/7 over which 

the attractiveness changes significantly from J to Gye + for equation (10.4) 
i Bo /2 for equation (10.5). 

f@ the actual implementation, the attractiveness function 3(r) can be 

monotonically decreasing functions such as the following generalized 





(10.5) 








B(r) = Boe", = (m>1). (10.6) 
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For a fixed ¥, the characteristic length becomes 
faq = 1, m — oo. (10.7) 
Conversely, for a given length scale [ in an optimization problem, the 
parameter y can be used as a typical initial value. That is 
1 
Y= Tm (10.8) 
The distance between any two fireflies i and j at x; and a,, respectively, 
is the Cartesian distance 





d 


rig = lls — al = | Do (ein — 25,6)? (10.9) 
k=1 


where x;,, is the kth component of the spatial coordinate x; of ith firefly. 
In 2-D case, we have 





The movement of a firefly 7 is attracted to another more attractive 
(brighter) firefly 7 is determined by 


eC; 23+ Boe 7H (a; = xi) +Q &, (10.11) 


where the second term is due to the attraction. The third term is random- 
ization with a being the randomization parameter, and e€; is a vector of 
random numbers drawn from a Gaussian distribution or uniform distribu- 
tion. For example, the simplest form is €; can be replaced by rand — 1/2 
where rand is a random number generator uniformly distributed in [0, 1]. 
For most our implementation, we can take 89 = 1 and a € [0,1]. 

It is worth pointing out that (10.11) is a random walk biased towards the 
brighter fireflies. If 69 = 0, it becomes a simple random walk. Furthermore, 
the randomization term can easily be extended to other distributions such 
as Lévy flights. 

The parameter y now characterizes the variation of the attractiveness, 
and its value is crucially important in determining the speed of the con- 
vergence and how the FA algorithm behaves. In theory, y € [0,00), but 
in practice, y = O(1) is determined by the characteristic length I of the 
system to be optimized. Thus, for most applications, it typically varies 
from 0.1 to 10. 









0.4 SCALINGS AND ASYMPTOTICS 
I is worth pointing out that the distance r defined above is not limited to 
the Euclidean distance. We can define other distance r in the n-dimensional 
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hyperspace, depending on the type of problem of our interest. For example, 
for job scheduling problems, r can be defined as the time lag or time in- 
terval. For complicated networks such as the Internet and social networks, 
the distance r can be defined as the combination of the degree of local 
clustering and the average proximity of vertices. In fact, any measure that 
can effectively characterize the quantities of interest in the optimization 
problem can be used as the ‘distance’ r. 

The typical scale I should be associated with the scale concerned in 
our optimization problem. If T is the typical scale for a given optimization 
problem, for a very large number of fireflies n >> k where k is the number of 
local optima, then the initial locations of these n fireflies should distribute 
relatively uniformly over the entire search space. As the iterations proceed, 
the fireflies would converge into all the local optima (including the global 
ones). By comparing the best solutions among all these optima, the global 
optima can easily be achieved. Our recent research suggests that it is 
possible to prove that the firefly algorithm will approach global optima 
when n — oo andt > 1. In reality, it converges very quickly and this will 
be demonstrated later in this chapter. 

There are two important limiting or asymptotic cases when y — 0 and 
y — coo. For y — 0, the attractiveness is constant @ = 69 and [T — oo, 
this is equivalent to saying that the light intensity does not decrease in an 
idealized sky. Thus, a flashing firefly can be seen anywhere in the domain. 
Thus, a single (usually global) optima can easily be reached. If we remove 
the inner loop for j in Figure 11.1 and replace x; by the current global 
best g,, then the Firefly Algorithm becomes the special case of accelerated 
particle swarm optimization (PSO) discussed earlier. Subsequently, the 
efficiency of this special case is the same as that of PSO. 

On the other hand, the limiting case y — 00 leads to T > 0 and G(r) > 
d(r) which is the Dirac delta function, which means that the attractiveness 
is almost zero in the sight of other fireflies. This is equivalent to the case 
where the fireflies roam in a very thick foggy region randomly. No other 
fireflies can be seen, and each firefly roams in a completely random way. 
Therefore, this corresponds to the completely random search method. 

As the firefly algorithm is usually in the case between these two extremes, 
it is possible to adjust the parameter y and a so that it can outperform 
both the random search and PSO. In fact, FA can find the global optima 
as well as the local optima simultaneously and effectively. This advantage 
will be demonstrated in detail later in the implementation. 

A further advantage of FA is that different fireflies will work almost 

r aictesdpdependently, it is thus particular suitable for parallel implementation. It 
x yen better than genetic algorithms and PSO because fireflies aggregate 
; é closely around each optimum. It can be expected that the interactions 
en different subregions are minimal in parallel implementation. 
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Figure 10.2: Landscape of a function with two equal global maxima. 


10.5 IMPLEMENTATION 


In order to demonstrate how the firefly algorithm works, we have imple- 
mented it in Matlab/Octave to be given later. 

In order to show that both the global optima and local optima can be 
found simultaneously, we now use the following four-peak function 





f(x,y) =e @-49- 4 4 gp 44)? —0-4)? 4 fe 27-0? 4 gut 4), 
where (x,y) € [—5,5] x [—5,5]. This function has four peaks. Two local 
peaks with f = 1 at (—4,4) and (4,4), and two global peaks with fax = 2 
at (0,0) and (0,—4), as shown in Figure 10.2. We can see that all these 
four optima can be found using 25 fireflies in about 20 generations (see Fig. 
10.3). So the total number of function evaluations is about 500. This is 
much more efficient than most of existing metaheuristic algorithms. 


% Firefly Algorithm by X S Yang (Cambridge University) 
% Usage: ffa_demo([number_of_fireflies,MaxGeneration] ) 
h eg: ffa_demo([12,50]); 

function [best]=firefly_simple (instr) 

y% n=number of fireflies 

% MaxGeneration=number of pseudo time steps 

if nargin<1, instr=[12 50]; end 

n=instr(1); MaxGeneration=instr (2) ; 

rand(‘state’,0); % Reset the random generator 

iy, te soon Four peak functions --------------------- 


% 


Stri=‘exp(-(x-4) *2-(y-4) 72) +exp(-(x+4) 72-(y-4) 72); 





10.5 IMPLEMENTATION 


% vange=([xmin xmax ymin ymax] ; 
range=[-5 5 -5 5]; 


¥ ------------------------------------------------ 
alpha=0.2; % Randomness 0--1 (highly random) 
gamma=1.0; % Absorption coefficient 

¥ ------------------------------------------------ 
% Grid values are used for display only 

Ngrid=100; 


dx= (range (2)-range(1))/Ngrid; 

dy=(range (4) -range(3))/Ngrid; 

[x,y] =meshgrid(range(1):dx:range(2),... 
range (3) :dy:range(4)) ; 


z=f (x,y); 

% Display the shape of the objective function 
figure(1); surfc(x,y,zZ); 

¥ ------------------------------------------------ 


% generating the initial locations of n fireflies 
(xn, yn, Lightn]=init_ffa(n, range) ; 

% Display the paths of fireflies in a figure with 
% contours of the function to be optimized 


figure(2) ; 
% Iterations or pseudo time marching 
for i=1:MaxGeneration, hhhhh start iterations 


% Show the contours of the function 
contour(x,y,z,15); hold on; 

% Evaluate new solutions 

zn=f (xn, yn) ; 


% Ranking the fireflies by their light intensity 
([Lightn, Index]=sort (zn) ; 
xn=xn(Index); yn=yn(Index) ; 


xO=xn; yo=yn; Lighto=Lightn; 
% Trace the paths of all roaming fireflies 
plot(xn,yn,‘.’,‘markersize’ ,10, ‘markerfacecolor’,‘g’); 


% Move all fireflies to the better locations 
(xn, yn]=ffa_move(xn,yn,Lightn,xo,yo,... 
Lighto,alpha,gamma,range) ; 

drawnow; 

% Use "hold on" to show the paths of fireflies 
hold off; 

end %%U%h end of iterations 

:,1)=xo’; best(:,2)=yo’; best(:,3)=Lighto’ ; 


etion [xn, yn, Lightn]=init_ffa(n, range) 
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Figure 10.3: The initial locations of 25 fireflies (left) and their final locations 
after 20 iterations (right). 


xrange=range(2)-range(1) ; 
yrange=range (4) -range(3) ; 
xn=rand(1,n)*xrangetrange(1) ; 
yn=rand(1,n)*yrangetrange (3) ; 
Lightn=zeros(size(yn) ) ; 


% Move all fireflies toward brighter ones 

function [xn,yn]=ffa_move(xn,yn,Lightn,xo,yo,... 
Lighto,alpha, gamma, range) 

ni=size(yn,2); nj=size(yo,2); 

for i=1:ni, 

% The attractiveness parameter beta=exp(-gamma*r) 
for j=i:nj, 

r=sqrt ((xn(i)-xo(j))72+(yn(i)-yo(j))72); 

if Lightn(i)<Lighto(j), % Brighter and more attractive 

beta0=1; beta=beta0*exp(-gamma*tr.~2) ; 

xn(i)=xn(i).*(1-beta)+xo(j) .*betatalpha.*(rand-0.5) ; 

yn(i)=yn(i) .*(1-beta)+yo(j) .*betatalpha.*(rand-0.5) ; 

end 
end % end for j 

end % end for i 

(xn, yn]=findrange(xn,yn,range) ; 


% Make sure the fireflies are within the range 
function [xn,yn]=findrange(xn,yn,range) 

eb Matthey, for i=1:length(yn) , 

if xn(i)<=range(1), xn(i)=range(1); end 

3 if xn(i)>=range(2), xn(i)=range(2); end 

if yn(i)<=range(3), yn(i)=range(3); end 

© if yn(i)>=range(4), yn(i)=range(4); end 
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In the implementation, the values of the parameters are a = 0.2, y= 1 
and 39 = 1. Obviously, these parameters can be adjusted to suit for solving 
various problems with different scales. 


10.6 FA VARIANTS 


The basic firefly algorithm is very efficient, but we can see that the solutions 
are still changing as the optima are approaching. It is possible to improve 
the solution quality by reducing the randomness. 

A further improvement on the convergence of the algorithm is to vary 
the randomization parameter a so that it decreases gradually as the optima 
are approaching. For example, we can use 


OQ = Ago + (A — Aco)e*, (10.12) 


where t € [0,tmax] is the pseudo time for simulations and tax is the max- 
imum number of generations. ag is the initial randomization parameter 
while a. is the final value. We can also use a similar function to the 
geometrical annealing schedule. That is 


a= a6", (10.13) 


where @ € (0, 1] is the randomness reduction constant. 

In addition, in the current version of the FA algorithm, we do not ex- 
plicitly use the current global best g,, even though we only use it to decode 
the final best solutions. Our recent studies show that the efficiency may 
significantly improve if we add an extra term Ae;(x; — g,) to the updating 
formula (10.11). Here \ is a parameter similar to a and @, and e; is a 
vector of random numbers. These could form important topics for further 
research. 


10.7 SPRING DESIGN 


The design of a tension and compression spring is a well-known benchmark 
optimization problem. The main aim is to minimize the weight subject 
to constraints on deflection, stress, surge frequency and geometry. It in- 
volves three design variables: the wire diameter x1, coil diameter x2 and 
number/length of the coil 73. This problem can be summarized as 


gb Methane minimize f(x) = x{x2(2+ 23), (10.14) 





hd 


ct to the following constraints 
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Ax? — 24122 1 
= 1<0, 
so(®) = 3556632, 25) * bloaa? 1 
140.452 
eae 
93(x) T3203 =¥; 
ga(@) = oe at 20, (10.15) 


The simple bounds on the design variables are 
0.05 < 21 < 2.0, 0.25 < a < 1.3, 2.0 < x3 < 15.0. (10.16) 
The best solution found in the literature (e.g., Cagnina et al. 2008) is 
x, = (0.051690, 0.356750, 11.287126), (10.17) 


with the objective 
f(x.) = 0.012665. (10.18) 


We now provide the Matlab implementation of our firefly algorithm to- 
gether with the penalty method for incorporating constraints. You may 
need a newer version of Matlab to deal with function handles. If you run 
the program a few times, you can get the above optimal solutions. It is 
even possible to produce better results if you experiment the program for 


a while. 

eS So a SS se h 
% Firefly Algorithm for constrained optimization h 
% by Xin-She Yang (Cambridge University) Copyright @2009 % 
ea ana h 


function fa_mincon_demo 


% parameters [n N_iteration alpha betamin gamma] 
para=[40 150 0.5 0.2 1); 


% This demo uses the Firefly Algorithm to solve the 
% (Spring Design Problem as described by Cagnina et al., 
%, Informatica, vol. 32, 319-326 (2008). ] 


% Simple bounds/limits 

disp(’Solve the simple spring design problem ...’); 
Lb=[0.05 0.25 2.0]; 

Ub=[2.0 1.3 15.0]; 






%, Initial random guess 


(@=(Lb+Ub) /2; 


§ 
<u, fval ,NumEval]=ffa_mincon(@cost ,@constraint ,u0,Lb,Ub,para) ; 
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% Display results 

bestsolution=u 

bestojb=fval 
total_number_of_function_evaluations=NumEval 


hhh Put your own cost/objective function here -------- hhh 
hh Cost or Objective function 

function z=cost (x) 

z=(2+x(3)) *x(1)72*x (2) ; 


% Constrained optimization using penalty methods 
4 by changing f to F=f+ \sum lam_j*g~2_j*H_j(g_j) 
% where H(g)=0 if g<=0 (true), =1 if g is false 


hhh Put your own constraints here -------------------- hhh 
function [g,geq]=constraint (x) 

% All nonlinear inequality constraints should be here 

% If no inequality constraint at all, simple use g=[]; 
g(1)=1-x (2) *3*x (3) /(7178*x(1) 74) ; 

tmpf=(4*x (2) *2-x (1) *x (2) ) / (12566 (x (2) *x (1) “3-x(1)74)); 
g(2)=tmpf+1/(5108+x(1)72)-1; 
g(3)=1-140.45*x (1) /(x(2)“2*x(3)) ; 

g(4)=x(1)+x(2)-1.5; 


% all nonlinear equality constraints should be here 
% If no equality constraint at all, put geq=[] as follows 


geq=[]; 

hhh End of the part to be modified ------------------- yyy 
hhh oo Ah 
hhh Do not modify the following codes unless you want Ahh 
hhh to improve its performance etc 7%) 
Ba 


4% ===Start of the Firefly Algorithm Implementation ====== 
% Inputs: fhandle => @cost (your own cost function, 


h can be an external file ) 

% nonhandle => @constraint, all nonlinear constraints 
h can be an external file or a function 
hh Lb = lower bounds/limits 


h Ub = upper bounds/limits 

4% para == optional (to control the Firefly algorithm) 

% Outputs: nbest = the best solution found so far 
fbest = the best objective value 

NumEval = number of evaluations: n*MaxGeneration 
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% Start FA 

function [nbest,fbest,NumEval]... 
=ffa_mincon(fhandle,nonhandle,u0, Lb, Ub, para) 

% Check input parameters (otherwise set as default values) 

if nargin<6, para=[20 50 0.25 0.20 1]; end 

if nargin<5, Ub=[]; end 

if nargin<4, Lb=[]; end 

if nargin<3, 

disp(’Usuage: FA_mincon(@cost, @constraint ,u0,Lb,Ub,para)’); 

end 


% n=number of fireflies 
% MaxGeneration=number of pseudo time steps 


4, i te ee a eS ee a ee ee ee eee 


% alpha=0.25; % Randomness 0--1 (highly random) 
% betamn=0.20; % minimum value of beta 
% gamma=1 ; % Absorption coefficient 


if poeGcdewehesee eee ede ee eee eo sanks 
n=para(1); MaxGeneration=para(2) ; 
alpha=para(3); betamin=para(4); gamma=para(5) ; 


%, Total number of function evaluations 
NumEval=n*MaxGeneration; 


% Check if the upper bound & lower bound are the same size 
if length(Lb) ~=length(Ub), 

disp(’Simple bounds/limits are improper!’); 

return 
end 


% Calcualte dimension 
d=length(u0) ; 


% Initial values of an array 

zn=ones(n,1)*107100; 

y ------------------------------------------------ 
% generating the initial locations of n fireflies 
(ns ,Lightn]=init_ffa(n,d,Lb,Ub,u0) ; 


% Iterations or pseudo time marching 
for k=1:MaxGeneration, khhhh start iterations 
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zn(i)=Fun(fhandle,nonhandle,ns(i,:)); 
Lightn(i)=zn(i) ; 
end 


% Ranking fireflies by their light intensity/objectives 
[Lightn, Index]=sort (zn) ; 

ns_tmp=ns ; 

for i=i:n, 

ns(i,:)=ns_tmp(Index(i),:); 

end 


if, Find the current best 
nso=ns; Lighto=Lightn; 
nbest=ns(1,:); Lightbest=Lightn(1) ; 


% For output only 
fbest=Lightbest ; 


% Move all fireflies to the better locations 
[ns]=ffa_move(n,d,ns,Lightn,nso,Lighto,nbest, 
Lightbest ,alpha, betamin, gamma,Lb, Ub) ; 


end %%U%% end of iterations 


ho ----- All the subfunctions are listed here ------------ 
% The initial locations of n fireflies 
function [ns,Lightn]=init_ffa(n,d,Lb,Ub,u0) 
%, if there are bounds/limits, 
if length(Lb)>0, 
for i=i:n, 
ns(i,:)=Lb+(Ub-Lb) .*rand(1,d); 
end 
else 
% generate solutions around the random guess 
for i=i:n, 
ns(i,:)=u0+randn(1,d); 
end 
end 


y% initial value before function evaluations 
Lightn=ones(n,1)*107100; 
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% Updating fireflies 
for i=1:n, 
% The attractiveness parameter beta=exp(-gamma*r) 
for j=i:n, 
r=sqrt (sum((ns(i,:)-ns(j,:)).72)); 
% Update moves 
if Lightn(i)>Lighto(j), % Brighter and more attractive 
beta0=1; beta=(beta0-betamin) *exp(-gammatr.~2)+betamin; 
tmf=alpha.*(rand(1,d)-0.5).*scale; 
ns(i,:)=ns(i,:).*(1-beta)+nso(j,:).*betattmpf ; 
end 
end % end for j 


end % end for i 


% Check if the updated solutions/locations are within limits 
[ns]=findlimits(n,ns,Lb, Ub) ; 


% This function is optional, as it is not in the original FA 

% The idea to reduce randomness is to increase the convergence, 
% however, if you reduce randomness too quickly, then premature 
% convergence can occur. So use with care. 

function alpha=alpha_new(alpha,NGen) 

% alpha_n=alpha_0(1-delta) “NGen=0.005 

% alpha_0=0.9 

delta=1-(0.005/0.9)*(1/NGen) ; 

alpha=(1-delta) *alpha; 


y%, Make sure the fireflies are within the bounds/limits 
function [ns]=findlimits(n,ns,Lb,Ub) 
for i=1i:n, 
% Apply the lower bound 
ns_tmp=ns(i,:); 
I=ns_tmp<Lb; 
ns_tmp(I)=Lb(1I) ; 


% Apply the upper bounds 
J=ns_tmp>Ub ; 
ns_tmp(J)=Ub(J) ; 

% Update this new move 
ns(i,:)=ns_tmp; 


end 
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z=fhandle(u) ; 


% Apply nonlinear constraints by the penalty method 
% Z=f£+sum_k=1°N lam_k g_k*2 *H(g_k) where lam_k >> 1 
z=zt+getnonlinear (nonhandl1le,u) ; 


function Z=getnonlinear (nonhandle,u) 
Z=0; 

% Penalty constant >> 1 

lam=10715; lameq=10715; 

% Get nonlinear constraints 
[g,geq]=nonhandle(u) ; 


% Apply inequality constraints as a penalty function 
for k=1:length(g), 

Z=Z+ lam*g(k)“2*getH(g(k)) ; 
end 
% Apply equality constraints (when geq=[], length->0) 
for k=1:length(geq) , 

Z=Z+lameq*geq(k) ~2*geteqH (geq(k)) ; 
end 


% Test if inequalities hold 
% H(g) which is something like an index function 
function H=getH(g) 


end 


% Test if equalities hold 
function H=geteqH(g) 
if g==0, 
H=0; 
else 


end 
dh ==== End of Firefly Algorithm implementation ====== 
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Chapter 11 


BAT ALGORITHM 





Bat algorithm was formulated by Xin-She Yang in 2010, which showed some 
promising efficiency for global optimization. Interestingly, it can be viewed to 
link with harmony search and particle swarm optimization under appropriate 
conditions. 


11.1 ECHOLOCATION OF BATS 


11.1.1. Behaviour of microbats 


Bats are fascinating animals. They are the only mammals with wings and they 
also have advanced capability of echolocation. It is estimated that there are about 
1000 different species which account for up to 20% of all mammal species. Their 
size ranges from the tiny bumblebee bat (of about 1.5 to 2 g) to the giant bats with 
wingspan of about 2 m and weight up to about 1 kg. Microbats typically have 
forearm length of about 2.2 to 11 cm. Most bats uses echolocation to a certain 
degree; among all the species, microbats are a famous example as microbats use 
echolocation extensively, while megabats do not. 

Most microbats are insectivores. Microbats use a type of sonar, called echolo- 
cation, to detect prey, avoid obstacles, and locate their roosting crevices in the 
dark. These bats emit a very loud sound pulse and listen for the echo that bounces 
back from the surrounding objects. Their pulses vary in properties and can be 
correlated with their hunting strategies, depending on the species. Most bats 
use short, frequency-modulated signals to sweep through about an octave, while 
others more often use constant-frequency signals for echolocation. Their signal 
bandwidth varies with species, and often increases by using more harmonics. 

Studies show that microbats use the time delay from the emission and detec- 

otetaadion of the echo, the time difference between their two ears, and the loudness 
x tions of the echoes to build up three dimensional scenario of the surround- 
“Dhey can detect the distance and orientation of the target, the type of prey, 
éven the moving speed of the prey such as small insects. Indeed, studies 
ested that bats seem to be able to discriminate targets by the variations of 
Doppler effect induced by the wing-flutter rates of the target insects. 
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11.1.2 Acoustics of Echolocation 


Though each pulse only lasts a few thousandths of a second (up to about 8 to 10 
ms), however, it has a constant frequency which is usually in the region of 25 kHz 
to 150 kHz. The typical range of frequencies for most bat species are in the region 
between 25 kHz and 100 kHz, though some species can emit higher frequencies up 
to 150 kHz. Each ultrasonic burst may last typically 5 to 20 ms, and microbats 
emit about 10 to 20 such sound bursts every second. When hunting for prey, the 
rate of pulse emission can be sped up to about 200 pulses per second when they 
fly near their prey. Such short sound bursts imply the fantastic ability of the 
signal processing power of bats. In fact, studies show the integration time of the 
bat ear is typically about 300 to 400 ps. 

As the speed of sound in air is typically v = 340 m/s at room temperature, 
the wavelength X of the ultrasonic sound bursts with a constant frequency f is 
given by 

mn f? (11.1) 
which is in the range of 2 mm to 14 mm for the typical frequency range from 25 
kHz to 150 kHz. Such wavelengths are in the same order of their prey sizes. 

Amazingly, the emitted pulse could be as loud as 110 dB, and, fortunately, 
they are in the ultrasonic region. The loudness also varies from the loudest when 
searching for prey and to a quieter base when homing towards the prey. The 
travelling range of such short pulses are typically a few metres, depending on 
the actual frequencies. Microbats can manage to avoid obstacles as small as thin 
human hairs. 

Obviously, some bats have good eyesight, and most bats also have very sen- 
sitive smell sense. In reality, they will use all the senses as a combination to 
maximize the efficient detection of prey and smooth navigation. However, here 
we are only interested in the echolocation and the associated behaviour. 

Such echolocation behaviour of microbats can be formulated in such a way that 
it can be associated with the objective function to be optimized, and this makes it 
possible to formulate new optimization algorithms. We will first outline the basic 
formulation of the Bat Algorithm (BA) and then discuss its implementation. 


11.2) BAT ALGORITHM 


If we idealize some of the echolocation characteristics of microbats, we can develop 
various bat-inspired algorithms or bat algorithms. For simplicity, we now use the 
following approximate or idealized rules: 


1. All bats use echolocation to sense distance, and they also ‘know’ the dif- 
ference between food/prey and background barriers; 


y 


Bats fly randomly with velocity v; at position x; with a fixed frequency 
fmin (or wavelength X), varying wavelength X (or frequency f) and loudness 
Ao to search for prey. They can automatically adjust the wavelength (or 
frequency) of their emitted pulses and adjust the rate of pulse emission 
r € [0,1], depending on the proximity of their target; 








11.2 BAT ALGORITHM 99 


3. Although the loudness can vary in many ways, we assume that the loudness 
varies from a large (positive) Ao to a minimum value Amin. 


Another obvious simplification is that no ray tracing is used in estimating 
the time delay and three dimensional topography. Though this might be a good 
feature for the application in computational geometry, however, we will not use 
this, as it is more computationally extensive in multidimensional cases. 

In addition to these simplified assumptions, we also use the following ap- 
proximations, for simplicity. In general the frequency f in a range [fmin, fmax| 
corresponds to a range of wavelengths [Amin, Amax]. For example, a frequency 
range of [20 kHz, 500 kHz] corresponds to a range of wavelengths from 0.7 mm 
to 17 mm. 

For a given problem, we can also use any wavelength for the ease of implemen- 
tation. In the actual implementation, we can adjust the range by adjusting the 
frequencies (or wavelengths). The detectable range (or the largest wavelength) 
should be chosen such that it is comparable to the size of the domain of interest, 
and then toning down to smaller ranges. Furthermore, we do not necessarily have 
to use the wavelengths themselves at all, instead, we can also vary the frequency 
while fixing the wavelength A’. This is because and f are related, as Af is 
constant. We will use this later approach in our implementation. 

For simplicity, we can assume f € [0, fmax]. We know that higher frequencies 
have short wavelengths and travel a shorter distance. For bats, the typical ranges 
are a few metres. The rate of pulse can simply be in the range of [0,1] where 0 
means no pulses at all, and 1 means the maximum rate of pulse emission. 

Based on the above approximations and idealization, the basic steps of the 
Bat Algorithm (BA) can be summarized as the pseudo code shown in Fig. 11.1. 


11.2.1 Movement of Virtual Bats 


In simulations, we have to use virtual bats. We have to define the rules how their 
positions x; and velocities v; in a d-dimensional search space are updated. The 
new solutions x and velocities uv‘ at time step t are given by 


da = fmin + (fmax = fmin), (11.2) 
vp = vy! + (ei — ae) fi (11.3) 
vi=at+oul, (11.4) 


where @ € [0,1] is a random vector drawn from a uniform distribution. Here x. 
is the current global best location (solution) which is located after comparing all 
the solutions among all the n bats. As the product A; fi is the velocity increment, 
we can use f; (or A; ) to adjust the velocity change while fixing the other factor ; 
(or fi), depending on the type of the problem of interest. In our implementation, 
we will use fmin = 0 and fmax = O(1), depending on the domain size of the 
oblem of interest. Initially, each bat is randomly assigned a frequency which is 
awn uniformly from [fmin, fmax]- 


Znew = Lold + € A’, (11.5) 
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Bat Algorithm 





Initialize the bat population x; (i = 1,2,...,n) and v; 
Initialize frequencies f;, pulse rates r; and the loudness A; 
while (¢ <Mazx number of iterations) 
Generate new solutions by adjusting frequency, 
and updating velocities and locations/solutions [(11.2) to (11.4)] 
if (rand > r;) 
Select a solution among the best solutions 
Generate a local solution around the selected best solution 
end if 
Generate a new solution by flying randomly 
if (rand < A; & f(xi) < f(x.)) 
Accept the new solutions 
Increase r; and reduce A; 
end if 
Rank the bats and find the current best x. 
end while 





Figure 11.1: Pseudo code of the bat algorithm (BA). 


where € € [—1, 1] is a random number, while A’ =<A{> is the average loudness 
of all the bats at this time step. 

The update of the velocities and positions of bats have some similarity to the 
procedure in the standard particle swarm optimization, as f; essentially controls 
the pace and range of the movement of the swarming particles. To a degree, 
BA can be considered as a balanced combination of the standard particle swarm 
optimization and the intensive local search controlled by the loudness and pulse 
rate. 


11.2.2 Loudness and Pulse Emission 


Furthermore, the loudness A; and the rate r; of pulse emission have to be updated 
accordingly as the iterations proceed. As the loudness usually decreases once a 
bat has found its prey, while the rate of pulse emission increases, the loudness 
can be chosen as any value of convenience. For example, we can use Ag = 100 
and Amin = 1. For simplicity, we can also use Ag = 1 and Amin = 0, assuming 
Amin = 0 means that a bat has just found the prey and temporarily stop emitting 
yn 
“ey sound. Now we have 







A =aAi, ritt= re[l — exp(—7t)], (11.6) 


a 


ere a and y are constants. In fact, a is similar to the cooling factor of a 
v Yooling schedule in the simulated snviealng discussed earlier in this book. For 
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Figure 11.2: The eggcrate function (left) and the locations of 40 bats in the last 
ten iterations (right). 


any 0<a<1andy> 0, we have 
Ai=>0, rj sre, as t> oO. (11.7) 


In the simplest case, we can use a = ¥, and we have used a = y = 0.9 in our 
simulations. 

The choice of parameters requires some experimenting. Initially, each bat 
should have different values of loudness and pulse emission rate, and this can be 
achieved by randomization. For example, the initial loudness A? can typically be 
around [1,2], while the initial emission rate r? can be around zero, or any value 
r? € [0,1] if using (11.6). Their loudness and emission rates will be updated 
only if the new solutions are improved, which means that these bats are moving 
towards the optimal solution. 


11.3. VALIDATION AND DISCUSSIONS 


From the pseudo code, it is relatively straightforward to implement the Bat Al- 
gorithm in any programming language. For the ease of visualization, we have 
implemented it using Matlab for various test functions. 

There are many standard test functions for validating new algorithms. As a 
simple benchmark, let us look at the eggcrate function 


faa? +y? +25(sin? 2+ sin? y), (2,y) € [—2n, 27] x [—27, 27]. 


We know that f has a global minimum fmin = 0 at (0,0). In our implementation, 
we use n = 25 to 50 virtual bats, and a = 0.9. For the multimodal eggcrate 
(oletaedunction, a snapshot of the last 10 iterations is shown in Fig. 11.2 where all bats 
iy 16ve towards the global best: (0,0). 
y be bat algorithm is much superior to other algorithms in terms of accu- 
and efficiency (Yang and Deb, 2010). If we replace the variations of the 
rency fi by a random parameter and setting A; = 0 and r; = 1, the bat 
orithm essentially becomes the standard particle swarm optimization (PSO). 
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Similarly, if we do not use the velocities, we use fixed loudness and rate: A; and 
r;. For example, A; = r; = 0.7, this algorithm is virtually reduced to a simple 
harmony search (HS), as the frequency /wavelength change is essentially the pitch 
adjustment, while the rate of pulse emission is similar to the harmonic acceptance 
rate (here with a twist) in the harmony search algorithm. The current studies 
imply that the proposed new algorithm is potentially more powerful and thus 
should be investigated further in many applications of engineering and industrial 
optimization problems. 


11.4 IMPLEMENTATION 


Yj wanna anna nnn nnn nnn nnn n nnn nnn nnn n an - ‘ 
% Bat-inspired algorithm for continuous optimization h 
% Programmed by Xin-She Yang @Cambridge University h 
hy, SSS SRS S SRS SSS saa hhs SSS Sena a SSS eae sae n ae hh 


function [best,fmin,N_iter]=bat (para) 
% Default parameters 
if nargin<1, para=[10 0.25 0.5]; end 


n=para(1); % Population size, typically 10 to 25 
A=para(2) ; % Loudness (constant or decreasing) 
r=para(3) ; % Pulse rate (constant or decreasing) 
% This frequency range determines the scalings 

Qmin=0 ; % Frequency minimum 

Qmax=2 ; % Frequency maximum 

% Iteration parameters 

tol=107(-5); % Stop tolerance 

N_iter=0; %, Total number of function evaluations 
% Dimension of the search variables 

d=3; 

% Initial arrays 

Q=zeros(n,1); % Frequency 

v=zeros(n,d); %, Velocities 


% Initialize the population/solutions 
for i=1:n, 
Sol(i,:)=randn(1,d); 
Fitness(i)=Fun(Sol(i,:)); 
end 
%, Find the current best 
[fmin,I]=min(Fitness) ; 
best=Sol(I,:); 
% Start the iterations -- Bat Algorithm 
gh Metahcn while (fmin>tol) 


75 
Ss, 
& 


é % Loop over all bats/solutions 

for i=i:n, 
Q(i) =Qmin+ (Qmin-Qmax) *rand; 
v(i,:)=v(i,:)+(Sol(i, :)-best) *Q (i); 
S(i,:)=S0l(i,:)+v(i,:); 
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% Pulse rate 

if rand>r 
S(i,:)=best+0.01*randn(1,d); 

end 


% Evaluate new solutions 
Fnew=Fun(S(i,:)); 
% If the solution improves or not too loudness 
if (Fnew<=Fitness(i)) & (rand<A) , 
Sol(i,:)=S(i,:); 
Fitness (i)=Fnew; 
end 


% Update the current best 
if Fnew<=fmin, 
best=S(i,:); 
fmin=Fnew; 
end 
end 
N_iter=N_iter+n; 


end 

% Output/display 

disp([’Number of evaluations: ’ ,num2str(N_iter)]); 
disp([’Best =’ ,num2str(best),’ fmin=’ ,num2str(fmin)]); 
% Objective function -- Rosenbrock’s 3D function 


function z=Fun(u) 
z=(1-u(1))72+100* (u(2) -u(1) 72) “2+ (1-u (3) ) 72; 


11.5 FURTHER TOPICS 


From the formulation of the bat algorithm and its implementation and compar- 
ison, we can see that it is a very promising algorithm. It is potentially more 
powerful than particle swarm optimization and genetic algorithms as well as har- 
mony search. The primary reason is that BA uses a good combination of major 
advantages of these algorithms in some way. Moreover, PSO and harmony search 
are the special cases of the bat algorithm under appropriate simplifications. 

In addition, the fine adjustment of the parameters a and y can affect the con- 
vergence rate of the bat algorithm. In fact, parameter a acts in a similar role as 
the cooling schedule in the simulated annealing. Though the implementation is 
slightly more complicated than those for many other metaheuristic algorithms; 
however, it does show that it utilizes a balanced combination of the advantages 

aoterangt existing successful algorithms with innovative feature based on the echoloca- 
ay oh, behaviour of bats. New solutions are generated by adjusting frequencies, 
aess and pulse emission rates, while the proposed solution is accepted or not 
Bids on the quality of the solutions controlled or characterized by loudness 
spulse rate which are in turn related to the closeness or the fitness of the 
Aitions /solution to the global optimal solution. 
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The exciting results suggest that more studies will highly be needed to carry 
out the sensitivity analysis, to analyze the rate of algorithm convergence, and to 
improve the convergence rate even further. More extensive comparison studies 
with a more wide range of existing algorithms using much tough test functions 
in higher dimensions will pose more challenges to the algorithms, and thus such 
comparisons will potentially reveal the virtues and weakness of all the algorithms 
of interest. 

An interesting extension will be to use different schemes of wavelength or fre- 
quency variations instead of the current linear implementation. In addition, the 
rates of pulse emission and loudness can also be varied in a more sophisticated 
manner. Another extension for discrete problems is to use the time delay be- 
tween pulse emission and the echo bounced back. For example, in the travelling 
salesman problem, the distance between two adjacent nodes/cities can easily be 
coded as the time delay. As microbats use time difference between their two 
ears to obtain three-dimensional information, they can identify the type of prey 
and the velocity of a flying insect. Therefore, a further natural extension to the 
current bat algorithm would be to use the directional echolocation and Doppler 
effect, which may lead to even more interesting variants and new algorithms. 
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Chapter 12 


CUCKOO SEARCH 





Cuckoo search (CS) is one of the latest nature-inspired metaheuristic algorithms, 
developed in 2009 by Xin-She Yang of Cambridge University and Suash Deb of 
C. V. Raman College of Engineering. CS is based on the brood parasitism of 
some cuckoo species. In addition, this algorithm is enhanced by the so-called 
Lévy flights, rather than by simple isotropic random walks. Recent studies show 
that CS is potentially far more efficient than PSO and genetic algorithms. 


12.1 CUCKOO BREEDING BEHAVIOUR 


Cuckoo are fascinating birds, not only because of the beautiful sounds they can 
make, but also because of their aggressive reproduction strategy. Some species 
such as the ani and Guira cuckoos lay their eggs in communal nests, though they 
may remove others’ eggs to increase the hatching probability of their own eggs. 
Quite a number of species engage the obligate brood parasitism by laying their 
eggs in the nests of other host birds (often other species). 

There are three basic types of brood parasitism: intraspecific brood parasitism, 
cooperative breeding, and nest takeover. Some host birds can engage direct 
conflict with the intruding cuckoos. If a host bird discovers the eggs are not their 
owns, they will either get rid of these alien eggs or simply abandon its nest and 
build a new nest elsewhere. Some cuckoo species such as the New World brood- 
parasitic Tapera have evolved in such a way that female parasitic cuckoos are 
often very specialized in the mimicry in colour and pattern of the eggs of a few 
chosen host species. This reduces the probability of their eggs being abandoned 
and thus increases their reproductivity. 

In addition, the timing of egg-laying of some species is also amazing. Parasitic 

4 oletangyickoos often choose a nest where the host bird just laid its own eggs. In general, 
x € cuckoo eggs hatch slightly earlier than their host eggs. Once the first cuckoo 
‘ eis hatched, the first instinct action it will take is to evict the host eggs by 
propelling the eggs out of the nest, which increases the cuckoo chick’s 
© of food provided by its host bird. Studies also show that a cuckoo chick 
a also mimic the call of host chicks to gain access to more feeding opportunity. 
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12.2 LEVY FLIGHTS 


On the other hand, various studies have shown that flight behaviour of many an- 
imals and insects has demonstrated the typical characteristics of Lévy flights. A 
recent study by Reynolds and Frye shows that fruit flies or Drosophila melanogaster, 
explore their landscape using a series of straight flight paths punctuated by a sud- 
den 90° turn, leading to a Lévy-flight-style intermittent scale free search pattern. 
Studies on human behaviour such as the Ju/’hoansi hunter-gatherer foraging pat- 
terns also show the typical feature of Lévy flights. Even light can be related to 
Lévy flights. Subsequently, such behaviour has been applied to optimization and 
optimal search, and preliminary results show its promising capability. 


12.3. CUCKOO SEARCH 


For simplicity in describing our new Cuckoo Search, we now use the following 
three idealized rules: 


e Each cuckoo lays one egg at a time, and dumps its egg in a randomly 
chosen nest; 


e The best nests with high-quality eggs will be carried over to the next 
generations; 


e The number of available host nests is fixed, and the egg laid by a cuckoo 
is discovered by the host bird with a probability pa € [0,1]. In this case, 
the host bird can either get rid of the egg, or simply abandon the nest and 
build a completely new nest. 


As a further approximation, this last assumption can be approximated by a 
fraction pa of the n host nests are replaced by new nests (with new random 
solutions). 

For a maximization problem, the quality or fitness of a solution can simply be 
proportional to the value of the objective function. Other forms of fitness can be 
defined in a similar way to the fitness function in genetic algorithms. 

For the implementation point of view, we can use the following simple rep- 
resentations that each egg in a nest represents a solution, and each cuckoo can 
lay only one egg (thus representing one solution), the aim is to use the new and 
potentially better solutions (cuckoos) to replace a not-so-good solution in the 
nests. Obviously, this algorithm can be extended to the more complicated case 
where each nest has multiple eggs representing a set of solutions. For this present 
work, we will use the simplest approach where each nest has only a single egg. 
In this case, there is no distinction between egg, nest or cuckoo, as each nest 
corresponds to one egg which also represents one cuckoo. 

Based on these three rules, the basic steps of the Cuckoo Search (CS) can be 
summarized as the pseudo code shown in Fig. 12.1. 

When generating new solutions «+? for, say, a cuckoo i, a Lévy flight is 






at) 2 4 a@ Lévy(d), (12.1) 


i 
Where a > 0 is the step size which should be related to the scales of the problem 
Wof interests. In most cases, we can use a = O(L/10) where L is the characteristic 
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Cuckoo Search via Lévy Flights 





Objective function f(x), x = (a1,...,%a)" 
Generate initial population of n host nests x; 
while (t <MaxGeneration) or (stop criterion) 
Get a cuckoo randomly/generate a solution by Lévy flights 
and then evaluate its quality/fitness F; 
Choose a nest among n (say, j) randomly 
if (F; = F;), 
Replace 7 by the new solution 
end 
A fraction (pa) of worse nests are abandoned 
and new ones/solutions are built/generated 
Keep best solutions (or nests with quality solutions) 
Rank the solutions and find the current best 
end while 
Postprocess results and visualization 





Figure 12.1: Pseudo code of the Cuckoo Search (CS). 


scale of the problem of interest. The above equation is essentially the stochastic 
equation for a random walk. In general, a random walk is a Markov chain whose 
next status/location only depends on the current location (the first term in the 
above equation) and the transition probability (the second term). The product @ 
means entrywise multiplications. This entrywise product is similar to those used 
in PSO, but here the random walk via Lévy flight is more efficient in exploring 
the search space, as its step length is much longer in the long run. 

The Lévy flight essentially provides a random walk whose random step length 
is drawn from a Lévy distribution 


Lévy ~u=t, (laze): (12.2) 


which has an infinite variance with an infinite mean. Here the steps essentially 
form a random walk process with a power-law step-length distribution with a 
heavy tail. Some of the new solutions should be generated by Lévy walk around 
the best solution obtained so far, this will speed up the local search. However, a 
substantial fraction of the new solutions should be generated by far field random- 
ization and whose locations should be far enough from the current best solution, 

this will make sure that the system will not be trapped in a local optimum. 
From a quick look, it seems that there is some similarity between CS and 
i Mean: climbing in combination with some large scale randomization. But there are 
x significant differences. Firstly, CS is a population-based algorithm, in a 
‘similar to GA and PSO, but it uses some sort of elitism and/or selection 
ar to that used in harmony search. Secondly, the randomization in CS is 
» efficient as the step length is heavy-tailed, and any large step is possible. 
irdly, the number of parameters in CS to be tuned is fewer than GA and PSO, 
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and thus it is potentially more generic to adapt to a wider class of optimization 
problems. In addition, each nest can represent a set of solutions, CS can thus be 
extended to the type of meta-population algorithms. 


12.4 CHOICE OF PARAMETERS 


After implementation, we have to validate the algorithm using test functions with 
analytical or known solutions. For example, one of the many test functions we 
have used is the bivariate Michalewicz function 


2 


*) — sin(y) sin?” (29), (12.3) 


2m 
( 7 


f(x,y) = —sin(x) sin 
where m = 10 and (az,y) € [0,5] x [0,5]. This function has a global minimum 
fx © —1.8013 at (2.20319, 1.57049). This global optimum can easily be found 
using Cuckoo Search, and the results are shown in Fig. 12.2 where the final 
locations of the nests are also marked with © in the figure. Here we have used 
n = 15 nests, a = 1 and pa = 0.25. In most of our simulations, we have used 
n= 15 to 50. 

From the figure, we can see that, as the optimum is approaching, most nests 
aggregate towards the global optimum. We also notice that the nests are also 
distributed at different (local) optima in the case of multimodal functions. This 
means that CS can find all the optima simultaneously if the number of nests 
are much higher than the number of local optima. This advantage may become 
more significant when dealing with multimodal and multiobjective optimization 
problems. 

We have also tried to vary the number of host nests (or the population size 
mn) and the probability pa. We have used n = 5,10,15, 20, 30, 40,50, 100, 
150, 250,500 and pa = 0, 0.01, 0.05,0.1, 0.15, 0.2, 0.25, 0.3,0.4,0.5. From our 
simulations, we found that n = 15 to 40 and pa = 0.25 are sufficient for most 
optimization problems. Results and analysis also imply that the convergence 
rate, to some extent, is not sensitive to the parameters used. This means that 
the fine adjustment is not needed for any given problems. 


12.5 IMPLEMENTATION 


Of Se 
% Cuckoo algorithm by Xin-She Yang and Suasg Deb h 
% Programmed by Xin-She Yang at Cambridge University h 
pc a 
function [bestsol,fval]=cuckoo_search (Ngen) 

_% Here Ngen is the max number of function evaluations 

45, 


%, 


4f nargin<1, Ngen=1500; end 
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Figure 12.2: Search paths of nests using Cuckoo Search. The final locations of 

the nests are marked with © in the figure. 


n=25; 


% Discovery rate of alien eggs/solutions 
pa=0.25; 


% Random initial solutions 

nest=randn(n,d) ; 

fbest=ones(n,1)*107 (100) ; % minimization problems 
Kbest=1; 


for j=1:Ngen, 
j%, Find the current best 
Kbest=get_best_nest (fbest) ; 
% Choose a random nest (avoid the current best) 
k=choose_a_nest(n,Kbest) ; 
bestnest=nest (Kbest, :) 
% Generate a new solution (but keep the current best) 
s=get_a_cuckoo(nest(k,:) ,bestnest) ; 


/, Evaluate this solution 
fnew=fobj(s) ; 
colette, if fnew<=fbest(k), 
“a Se fbest (k)=fnew; 
nest (k,:)=s; 





ou, discovery and randomization 
Sif rand<pa, 
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k=get_max_nest (fbest) ; 
s=emptyit (nest (k,:)); 
nest(k,:)=s; 
fbest (k)=fobj(s) ; 
end 
end 


4) Post-optimization processing 
“4% Find the best and display 
[fval,I]=min(fbest) 
bestsol=nest(I,:) 


hh Display all the nests 
nest 


hh 777777777 All subfunctions are listed below ----------- 
hh Choose a nest randomly 

function k=choose_a_nest(n,Kbest) 

k=floor (rand*n) +1; 

% Avoid the best 


if k==Kbest, 
k=mod(k+1,n)+1; 
end 


hh Get a cuckoo and generate new solutions by ramdom walk 
function s=get_a_cuckoo(s,star) 

% This is a random walk, which is less efficient 

% than Levy flights. In addition, the step size 

% should be a vector for problems with different scales. 
% Here is the simplified implementation for demo only! 
stepsize=0.05; 

s=startstepsize*randn(size(s)); 


i, Find the worse nest 
function k=get_max_nest (fbest) 
[fmax ,k]=max(fbest) ; 


i“, Find the current best nest 
function k=get_best_nest (fbest) 
[fmin,k]=min(fbest) ; 


hh Replace some (of the worst nests) 
hh by constructing new solutions/nests 






eAgain the step size should be varied 
Here is a simplified approach 

> 

#s+0.05*randn(size(s)) ; 
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% d-dimensional objective function 

function z=fobj(u) 

% Rosenbrock’s function (in 2D) 

% It has an optimal solution at (1.000,1.000) 
z=(1-u(1))72+100* (u(2)-u(1) *2)°2; 


If we run this program using some standard test functions, we can observe that 
CS outperforms many existing algorithms such as GA and PSO. The primary 
reasons are: 1) a fine balance of randomization and intensification, and 2) fewer 
number of control parameters. As for any metaheuristic algorithms, a good 
balance of intensive local search and an efficient exploration of the whole search 
space will usually lead to a more efficient algorithm. On the other hand, there 
are only two parameters in this algorithm, the population size n, and pa. Once 
n is fixed, pa essentially controls the elitism and the balance of randomization 
and local search. Few parameters make an algorithm less complex and thus 
potentially more generic. Such observations deserve more systematic research 
and further elaboration in the future work. 

It is worth pointing out that there are three ways to carry out randomization: 
uniform randomization, random walks and heavy-tailed walks. The simplest way 
is to use a uniform distribution so that new solutions are limited between upper 
and lower bounds. Random walks can be used for global randomization or local 
randomization, depending on the step size used in the implementation. Lévy 
flights are heavy-tailed, which is most suitable for the randomization on the 
global scale. 

As an example for solving constrained optimization, we now solved the spring 
design problem discussed in the chapter on firefly algorithm. The Matlab code 
is given below 


% Cuckoo Search for nonlinear constrained optimization 
% Programmed by Xin-She Yang @ Cambridge University 2009 
function [bestsol,fval]=cuckoo_spring(N_iter) 

format long; 

% number of iterations 

if nargin<1, N_iter=15000; end 

% Number of nests 

n=25; 

disp(’Searching ... may take a minute or so ...’); 

% a variables and simple bounds 

% Lower and upper bounds 

Lb=[0.05 0.25 2.0]; 

Ub=[2.0 1.3 15.0]; 

% Number of variables 

d=length (Lb) ; 
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Kbest=1; 


% Start of the cuckoo search 

for j=1:N_iter, 
% Find the best nest 
[fmin,Kbest]=get_best_nest (fbest) ; 
% Choose a nest randomly 
k=choose_a_nest(n,Kbest) ; 
bestnest=nest (Kbest,:) ; 
% Get a cuckoo with a new solution 
s=get_a_cuckoo(nest(k,:) ,bestnest ,Lb,Ub) ; 


% Update if the solution improves 
fnew=fobj(s) ; 
if fnew<=fbest(k), 
fbest (k)=fnew; 
nest (k,:)=s; 
end 


% Discovery and randomization 
if rand<pa, 
k=get_max_nest (fbest) ; 
s=emptyit (nest (k,:),Lb,Ub); 
nest (k,:)=s; 
fbest (k)=fobj(s) ; 
end 

end 


i, Find the best 
[fmin, I]=min(fbest) 
bestsol=nest(I,:); 


% Show all the nests 

nest 

% Display the best solution 
bestsol, fmin 


% Initial locations of all n cuckoos 

function [guess]=init_cuckoo(n,d,Lb,Ub) 

for i=1:n, 
guess(i,1:d)=Lbtrand(1,d) .*(Ub-Lb) ; 


end 







=f loor (rand*n) +1; 
=p 
SAvoid the best 
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k=mod(k+1,n)+1; 
end 


uf, Get a cuckoo with a new solution via a random walk 
hh Note: Levy flights were not implemented in this demo 
function s=get_a_cuckoo(s,star,Lb,Ub) 
s=start+0.01*(Ub-Lb) .*randn(size(s)) ; 

s=bounds(s,Lb,Ub) ; 


%/ Find the worse nest 
function k=get_max_nest (fbest) 
[fmax,k]=max(fbest) ; 


hi Find the best nest 
function [fmin,k]=get_best_nest (fbest) 
[fmin,k]=min(fbest) ; 


hh Replace an abandoned nest by constructing a new nest 
function s=emptyit(s,Lb,Ub) 

s=s+0.01*(Ub-Lb) .*randn(size(s)); 

s=bounds(s,Lb,Ub) ; 


% Check if bounds are met 
function ns=bounds (ns ,Lb,Ub) 
% Apply the lower bound 
ns_tmp=ns ; 
I=ns_tmp<Lb; 
ns_tmp(I)=Lb(I) ; 
% Apply the upper bounds 
J=ns_tmp>Ub; 
ns_tmp(J)=Ub(J) ; 
% Update this new move 
ns=ns_tmp; 


% d-dimensional objective function 
function z=fobj(u) 

% The well-known spring design problem 
z=(2+u(3)) *u(1) “2*u(2) ; 
z=zt+getnonlinear (u) ; 


function Z=getnonlinear (u) 
Z=0; 
% Penalty constant 







equality constraints 
21-u(2) ~3*u(3)/(71785*u(1)~4) ; 
2 B= (4*u (2) 72-u(1) *u(2) ) / (12566* (u(2) #u(1) 73-u(1)74)); 
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g(2)=gtmp+1/(5108*u(1)72)-1; 
g(3)=1-140.45*u(1) /(u(2) *2*u(3)) ; 
g(4)=(u(1)+u(2))/1.5-1; 


% No equality constraint in this problem, so empty; 


geq=[]; 


% Apply inequality constraints 
for k=1:length(g), 

Z=Z+ lam*g(k)“2*getH(g(k)) ; 
end 
% Apply equality constraints 
for k=1:length(geq) , 

Z=Z+lam*geq(k) “2*getHeq(geq(k)) ; 

end 


% Test if inequalities hold 
% Index function H(g) for inequalities 
function H=getH(g) 
if g<=0, 
H=0; 
else 


end 
% Index function for equalities 
function H=getHeq(geq) 


This potentially powerful optimization algorithm can easily be extended to 
study multiobjective optimization applications with various constraints, even to 
NP-hard problems. Further studies can focus on the sensitivity and parameter 
studies and their possible relationships with the convergence rate of the algo- 
rithm. Hybridization with other popular algorithms such as PSO and differential 
evolution will also be potentially fruitful. 
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Chapter 13 


ANNS AND SUPPORT VECTOR MACHINE 





As this book is an introduction to metaheuristic algorithms, our focus has always 
been on each algorithm. However, some algorithms cannot be classified as meta- 
heuristic, but they are closely related to optimization. For this reason, we now 
briefly introduce artificial neural networks and support vector machine. As we 
will see, they are in essence optimization algorithms, working in different context. 


13.1. ARTIFICIAL NEURAL NETWORKS 


13.1.1 Artificial Neuron 


The basic mathematical model of an artificial neuron was first proposed by W. 
McCulloch and W. Pitts in 1943, and this fundamental model is referred to as 
the McCulloch-Pitts model. Other models and neural networks are based on it. 
An artificial neuron with n inputs or impulses and an output yx will be ac- 
tivated if the signal strength reaches a certain threshold #. Each input has a 
corresponding weight w; (see Fig. 13.1). The output of this neuron is given by 


n= (5 wis), (13.1) 


where the weighted sum € = es w;u; is the total signal strength, and ® is the 
so-called activation function, which can be taken as a step function. That is, we 


w= {5 a (13.2) 


have 


0 iff <8. 


e step function has discontinuity, sometimes, it is easier to use a nonlinear, 
sh function, known as the Sigmoid function 


S@ =o 
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(13.3) 
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Ui Wi Yk 


Figure 13.1: A simple neuron. 


which approaches 1 as U — oo, and becomes 0 as U — —oco. An interesting 
property of this function is 


S'(é) = S(é)[1 — S(E)]. (13.4) 


13.1.2 Neural Networks 


A single neuron can only perform a simple task — on or off. Complex functions 
can be designed and performed using a network of interconnecting neurons or 
perceptrons. The structure of a network can be complicated, and one of the 
most widely used is to arrange them in a layered structure, with an input layer, 
an output layer, and one or more hidden layer (see Fig. 13.2). The connection 
strength between two neurons is represented by its corresponding weight. Some 
artificial neural networks (ANNs) can perform complex tasks, and can simulate 
complex mathematical models, even if there is no explicit functional form math- 
ematically. Neural networks have developed over last few decades and have been 
applied in almost all areas of science and engineering. 

The construction of a neural network involves the estimation of the suitable 
weights of a network system with some training/known data sets. The task of the 
training is to find the suitable weights w;; so that the neural networks not only 
can best-fit the known data, but also can predict outputs for new inputs. A good 
artificial neural network should be able to minimize both errors simultaneously 
— the fitting/learning errors and the prediction errors. 

The errors can be defined as the difference between the calculated (or pred- 
icated) output o, and real output y, for all output neurons in the least-square 
sense 


1 
B=; d (0 =)". (13.5) 
=1 


Here the output og is a function of inputs/activations and weights. In order to 
minimize this error, we can use the standard minimization techniques to find the 


dE OE Aox 13.6) 


A = 
Sle "Deeg tl Se Ownk , 
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input layer (7) hidden layer (h) output neurons (k) 


Figure 13.2: Schematic representation of a three-layer neural 
network with n; inputs, m hidden nodes and n, outputs. 


where 77 is the learning rate. Typically, we can choose 7 = 1. 











From .. 
Sh = So wakon, (k =1,2,..., no), (13:7 
h=1 
and i 
on = f(Sk) = 7m (13.8 
we have 
f=f-f), (13.9 
Oox Oo; OSk 
= = 1 : 13.10 
Ownk OSk Ownk on ( Ok)On, ( 
and 3E 
aa = (0K — Yk). (13.11 
Therefore, we have 
Awnk => —oKOn, Ok = on(1 = Ox) (OK = Yk) (13.12 


13.1.3. Back Propagation Algorithm 


There are many ways of calculating weights by supervised learning. One of 
the simplest and widely used methods is to use the back propagation algo- 
rithm for training neural networks, often called back propagation neural networks 
4BPNNs). 


as, 


yMetan 
“a “Lhe basic idea is to start from the output layer and propagate backwards so 
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BPNN 


Initialize weight matrices Win, and Wp randomly 
for all training data points 
while ( residual errors are not zero ) 
Calculate the output for the hidden layer on using (18.13) 
Calculate the output for the output layer o, using (18.14) 
Compute errors dp and bp, using (13.15) and (13.16) 
Update weights win and wpe via (18.17) and (13.18) 
end while 
end for 








Figure 13.3: Pseudo code of back propagation neural networks. 








1 
On = =a é h=1,2,...,m), 13.13 
seal + exp[— pany WihUil ( ) ( ) 
and the outputs for the output nodes 

ee : (a ee (13.14) 

k 1+ exp|[— ae whkon]’ ek Re ee . 

The errors for the output nodes are given by 

Ok = or(1 — on) (yr — 0k), (k= 1D) ie, ie); (13.15) 


where yz(k = 1,2,...,%0) are the data (real outputs) for the inputs ui(i = 
1,2,...,n:). Similarly, the errors for the hidden nodes can be written as 


bn =on(1—on) S>wnrde, (b= 1,2, 5m). (13.16) 


k=1 


The updating formulae for weights at iteration t are 





wy! = whe + ndk0n, (13.17) 
and 
wie = Win + Non, (13.18) 


where 0 < 7 < 1 is the learning rate. 
Here we can see that the weight increments are 


ham Awih = onus, (13.19) 






fth similar updating formulae for wrx. An improved version is to use the so- 
ed weight momentum a to increase the learning efficiency 


Awin = Noni + awin(t — 1), (13.20) 
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where 7 is an extra parameter. 

There are many good software packages for artificial neural networks, and 
there are dozens of good books fully dedicated to implementation. Therefore, we 
will not provide any code here. 


13.2} SUPPORT VECTOR MACHINE 


Support vector machine is a powerful tool which becomes increasingly popular 
in classifications, data mining, pattern recognition, artificial intelligence, and 
optimization. 


13.2.1 Classifications 


In many applications, the aim is to separate some complex data into different 
categories. For example, in pattern recognition, we may need to simply separate 
circles from squares. That is to label them into two different classes. In other 
applications, we have to answer a yes-no question, which is a binary classification. 

If there are n different classes, we can in principle first classify them into two 
classes: class, say k, and non-class k. We then focus on the non-class k and divide 
them into two different classes, and so on and so forth. 

Mathematically speaking, for a given (but scattered) data set, the objective 
is to separate them into different regions/domains or types. In the simplest case, 
the outputs are just class either A or B; that is, either +1 or —1. 


13.2.2 Statistical Learning Theory 


For the case of two-class classifications, we have the learning examples or data 
as (%i, yi) where 7 = 1,2,...,n and y; € {—1,+1}. The aim of the learning is to 
find a function fa(#) from allowable functions {fa : a € Q} such that 


fol@i)> yi,  (t=1,2,...,7n), (13.21) 


and that the expected risk E(a) is minimal. That is the minimization of the risk 


= 5 fat Ah —wlar tea); (13.22) 


where P(a, y) is an unknown probability distribution, which makes it impossible 
to calculate E(a) directly. A simple approach is to use the so-called empirical 


risk a 
B,(a) & = S~ >| falas) — yi. 
P n 2 


i=l 





(13.23) 


in drawback of this approach is that a small risk or error on the training 
es not necessarily guarantee a small error on prediction if the number n of 
ing data is small. 

1 the framework of structural risk minimization and statistical learning the- 
i, there exists an upper bound for such errors. For a given probability of at 
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Figure 13.4: Hyperplane, maximum margins and linear support 
vector machine (SVM). 


least 1 — p, the Vapnik bound for the errors can be written as 








E(a) < Ep(a) + o(2, pee, (13.24) 
where 
(2, tea?) = ‘E [h(log = + 1) —log(2)]. (13.25) 


Here h is a parameter, often referred to as the Vapnik-Chervonenskis dimension 
(or simply VC-dimension). This dimension describes the capacity for prediction 
of the function set f.. In the simplest binary classification with only two values 
of +1 and —1, fh is essentially the maximum number of points which can be 
classified into two distinct classes in all possible 2” combinations. 


13.2.3. Linear Support Vector Machine 


The basic idea of classification is to try to separate different samples into different 
classes. For binary classifications such as the triangles and spheres (or solid dots) 
as shown in Fig. 13.4, we intend to construct a hyperplane 


w:-e2+b=0, (13.26) 


so that these samples can be divided into classes with triangles on one side 
and the spheres on the other side. Here the normal vector w and 6 have the 
same size as x, and they can be determined using the data, though the method 






éher methods. 

SIn essence, if we can construct such a hyperplane, we should construct two 
yperplanes (shown as dashed lines) so that the two hyperplanes should be as 
oar away as possible and no samples should be between these two planes. Math- 
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ematically, this is equivalent to two equations 
w:-e2+b=41, (13.27 


and 





w:-x2t+b=-l. (13.28 


From these two equations, it is straightforward to verify that the normal (per- 
pendicular) distance between these two hyperplanes is related to the norm ||w]| 
via 


_ 2 
|||] 
A main objective of constructing these two hyperplanes is to maximize the dis- 
tance or the margin between the two planes. The maximization of d is equivalent 
to the minimization of ||w|| or more conveniently ||w||?. From the optimization 
point of view, the maximization of margins can be written as 





(13.29 


(w-w). (13.30) 


minimize 1 awl ! 
Ze = =< 
2 2 


If we can classify all the samples completely, for any sample (x;, y;) where 
i= 1,2,...,n, we have 
w-2,+b>+4+1, if (a, yi) € one class, (13.31) 


and 
w:a2,+b<-1, if (wi, y:) € the other class. (13.32) 


As y; € {+1,—-1}, the above two equations can be combined as 
yi(w-a;, +6) >1, (i =1,2,...,n). (13.33) 


However, in reality, it is not always possible to construct such a separating hy- 
perplane. A very useful approach is to use non-negative slack variables 


m > 0, (@= 1) 2307), (13.34) 
so that 
yi(w-a, +b) >1—m, G@=1,2,.5,7). (13.35) 
Now the optimization for the support vector machine becomes 
ee 1 2 . 
minimize UV = gull + AZ (13.36) 
subject to yi(w- a; +b) >1—m, (13.37) 
m > 0, (@ = 1,2,...,n), (13.38) 


‘here > 0 is a parameter to be chosen appropriately. Here, the term a7 a Ni 
ntially a measure of the upper bound of the number of misclassifications 
training data. 
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By using Lagrange multipliers a; > 0, we can rewrite the above constrained 
optimization into an unconstrained version, and we have 


1 n n 
L= Sle’ +A) m — > ailyi(w- a; +b) —(-n)]. (13.39) 
t=1 i=1 


From this, we can write the Karush-Kuhn-Tucker conditions 


aL ~ 
ap UO S¢ aiyiwi = 0, (13.40) 

i=1 

aL - 
ae 2. avy: = 0, (13.41) 
yi(w-a;,+b)—-(1—m) > 0, (13.42) 
ailyi(w - a2; +b) — (1—m)] =0, (¢ =1,2,...,n), (13.43) 
a>0, m>0, (¢=1,2,...,n). (13.44) 


From the first KKT condition, we get 
w= Yo aie (13.45) 
i=1 


It is worth pointing out here that only the nonzero coefficients a; contribute to 
the overall solution. This comes from the KKT condition (13.43), which implies 
that when a; 4 0, the inequality (13.37) must be satisfied exactly, while ao = 0 
means the inequality is automatically met. In this latter case, 7; = 0. Therefore, 
only the corresponding training data (a;,y:) with a; > 0 can contribute to the 
solution, and thus such x; form the support vectors (hence, the name support 
vector machine). All the other data with a; = 0 become irrelevant. 

V. Vapnik and B. Schélkopf et al. have shown that the solution for a; can be 
found by solving the following quadratic programming 


n 1 n 
maximize S- M5 Y Oi; Yyiy;(Li- 25), (13.46) 
i=l i,j=l 
subject to 
So avi =0, O<a<dA, (i=1,2,...,n). (13.47) 
i=1 
From the coefficients a;, we can write the final classification or decision function 
as , 
f(x) = sen[ )> aiyi(@ vi) + b}. (13.48) 
i=l 
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a) input space b) feature space. 
(a) input sp ( p 


Figure 13.5: Kernel functions and nonlinear transformation. 


13.2.4 Kernel Functions and Nonlinear SVM 


In reality, most problems are nonlinear, and the above linear SVM cannot be 
used. Ideally, we should find some nonlinear transformation ¢ so that the data 
can be mapped onto a high-dimensional space where the classification becomes 
linear (see Fig. 13.5). The transformation should be chosen in a certain way so 
that their dot product leads to a kernel-style function K(#,xi) = $(x) - d(a:), 
which enables us to write our decision function as 


f(x) = sen[ S > aiyiK (a, ai) +}. (13.49) 


vie 


From the theory of eigenfunctions, we know that it is possible to expand functions 
in terms of eigenfunctions. In fact, we do not need to know such transformations, 
we can directly use kernel functions K(x, «;) to complete this task. This is the 
so-called kernel function trick. Now the main task is to chose a suitable kernel 
function for a given problem. 

For most problems concerning a nonlinear support vector machine, we can use 
K(a,@i) = (a - a;)¢ for polynomial classifiers, K(a,#i) = tanh[k(a# - 2:) + ©)] 
for neural networks. The most widely used kernel is the Gaussian radial basis 
function (RBF) 


K(w, #1) = exp[-||a — @i||?/(20”)] = exp[—ylla — @i||7), (13.50) 


for nonlinear classifiers. This kernel can easily be extended to any high dimen- 
sions. Here o? is the variance and y = 1/20” is a constant. 

Following a similar procedure as discussed earlier for linear SVM, we can 
obtain the coefficients a; by solving the following optimization problem 


en < 1 
maximize So ay — 5 tia yiys K (wi, w5).- (13.51) 


i=l 





s orth pointing out under Mercer’s conditions for kernel functions, the matrix 
yiy;K(a;,@;) is a symmetric positive definite matrix, which implies that 


126 CHAPTER 13. ANNS AND SUPPORT VECTOR MACHINE 


the above maximization is a quadratic programming problem, and can thus be 
solved efficiently by many standard QP techniques. 


There are many software packages (commercial or open source) which are 


easily available, so we will not provide any discussion of the implementation. In 
addition, some methods and their variants are still an area of active research. 


Int 


erested readers can refer to more advanced literature. 
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Chapter 14 


METAHEURISTICS — A UNIFIED APPROACH 





We have introduced many nature-inspired metaheuristic algorithms in this book. 
Some algorithms have strong similarities, while others may be directly based on 
or inspired by some core algorithms. Now a natural question is what possible 
links among these algorithms could be, and in what way? This chapter concludes 
this book with an intention to unify the metaheuristics. 


14.1. INTENSIFICATION AND DIVERSIFICATION 


The efficiency of metaheuristic algorithms can be attributed to the fact that 
they imitate the best features in nature, especially the selection of the fittest in 
biological systems which have evolved by natural selection over millions of years. 

Two important characteristics of metaheuristics are: intensification and diver- 
sification (Blum and Roli 2003). Intensification intends to search locally and more 
intensively, while diversification makes sure the algorithm explores the search 
space globally (hopefully also efficiently). 

Furthermore, intensification is also called exploitation, as it typically searches 
around the current best solutions and selects the best candidates or solutions. 
Similarly, diversification is also called exploration, as it strives to explore the 
search space more efficiently, often by large-scale randomization. 

The fine balance between these two components is very important to the 
overall efficiency and performance of an algorithm. Too little exploration and 
too much exploitation could cause the system to be trapped in local optima, 
which makes it very difficult or even impossible to find the global optimum. 
On the other hand, if too much exploration but too little exploitation, it may 
be difficult for the system to converge and thus slows down the overall search 
performance. The proper balance itself is an optimization problem, and one of 
the main tasks of designing new algorithms is to find a certain balance concerning 
this optimality and/or tradeoff. 

metihen,. Furthermore, just exploitation and exploration are not enough. During the 
& “atch, we have to use a proper mechanism or criterion to select the best solutions. 
“most common criterion is to use the Survival of the Fittest, that is to keep 
ting the current best found so far. In addition, certain elitism is often used, 
Isthis is to ensure the best or fittest solutions are not lost, and should be 
a8sed onto the next generations. 
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14.2. WAYS FOR INTENSIFICATION AND DIVERSIFICATION 


There are many ways of carrying out intensification and diversification. In fact, 
each algorithm and its variants use different ways of achieving the balance of 
between exploration and exploitation. 

By analyzing all the metaheuristic algorithms, we can categorically say that 
the way to achieve exploration or diversification is mainly by certain randomiza- 
tion in combination with a deterministic procedure. This ensures that the newly 
generated solutions distribute as diversely as possible in the feasible search space. 
One of simplest and yet most commonly used randomization techniques is to use 


@new = D+ (U a L) * Eu, (14.1) 


where LE and U are the lower bound and upper bound, respectively. €,, is a uni- 
formly distributed random variable in [0,1]. This is often used in many algorithms 
such as harmony search, particle swarm optimization and firefly algorithm. 

It is worth pointing that the use of a uniform distribution is not the only 
way to achieve randomization. In fact, random walks such as Lévy flights on a 
global scale are more efficient. We can use the same equation (14.2) to carry out 
randomization, and the only difference is to use a large step size s so that the 
random walks can cover a larger region on the global large scale. 

A more elaborate way to obtain diversification is to use mutation and crossover. 
Mutation makes sure new solutions are as far/different as possible, from their par- 
ents or existing solutions; while crossover limits the degree of over diversification, 
as new solutions are generated by swapping parts of the existing solutions. 

The main way to achieve the exploitation is to generate new solutions around 
a promising or better solution locally and more intensively. This can be easily 
achieved by a local random walk 


Lnew = Lola + § WwW, (14.2) 


where w is typically drawn from a Gaussian distribution with zero mean. Here 
s is the step size of the random walk. In general, the step size should be small 
enough so that only local neighbourhood is visited. If s is too large, the region 
visited can be too far away from the region of interest, which will increase the 
diversification significantly but reduce the intensification greatly. Therefore, a 
proper step size should be much smaller than (and be linked with) the scale of 
the problem. For example, the pitch adjustment in harmony search and the move 
in simulated annealing are a random walk. 

If we want to increase the efficiency of this random walk (and thus increase 
the efficiency of exploration as well), we can use other forms of random walks 
such as Lévy flights where s is drawn from a Lévy distribution with large step 
sizes. In fact, any distribution with a long tail will help to increase the step size 
and distance of such random walks. 

Even with the standard random walk, we can use a more selective or con- 


Lnew = Lbest + SW. (14.3) 
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Some intensification technique is not easy to decode, but may be equally 
effective. The crossover operator in evolutionary algorithms is a good example, 
as it uses the solutions/strings from parents to form offsprings or new solutions. 

In many algorithms, there is no clear distinction or explicit differentiation 
between intensification and diversification. These two steps are often intertwined 
and interactive, which may, in some cases, become an advantage. Good examples 
of such interaction is the genetic algorithms, harmony search and bat algorithm. 
Readers can analyze any chosen algorithm to see how these components are 
implemented. 

In addition, the selection of the best solutions is a crucial component for the 
success of an algorithm. Simple, blind exploration and exploitation may not 
be effective without the proper selection of the solutions of good quality. Simply 
choosing the best may be effective for optimization problems with a unique global 
optimum. Elitism and keeping the best solutions are efficient for multimodal 
and multi-objective problems. Elitism in genetic algorithms and selection of 
harmonics are good examples of the selection of the fittest. 

In contrast with the selection of the best solutions, an efficient metaheuristic 
algorithm should have a way to discard the worse solutions so as to increase the 
overall quality of the populations during evolution. This is often achieved by 
some form of randomization and probabilistic selection criteria. For example, 
mutation in genetic algorithms acts a way to do this. Similarly, in the cuckoo 
search discussed earlier, the castaway of a nest /solution is another good example. 

Another important issue is the randomness reduction. Randomization is 
mainly used to explore the search space diversely on the global scale, and also, to 
some extent, the exploitation on a local scale. As better solutions are found, and 
as the system converges, the degree of randomness should be reduced; otherwise, 
it will slow down the convergence. For example, in particle swarm optimization, 
randomness is automatically reduced as the particles swarm together, this is be- 
cause the distance between each particle and the current global best is becoming 
smaller and smaller. Similarly, randomness is effectively reduced in differential 
evolution 

Lnew = E+ F(x; =— x5), (14.4) 


whose last term is decreasing as the difference vector gets smaller and smaller. 

In other algorithms, randomness is not reduced and but controlled and se- 
lected. For example, the mutation rate is usually small so as to limit the ran- 
domness, while in simulated annealing, the randomness during iterations may 
remain the same, but the solutions or moves are selected and acceptance proba- 
bility becomes smaller. 

Finally, from the implementation point of view, the actual implementation 
does vary, even though the pseudo code should give a good guide and should not in 
principle lead to ambiguity. However, in practice, the actual way of implementing 
the algorithm does affect the performance to some degree. Therefore, validation 
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14.3 GENERALIZED EVOLUTIONARY WALK ALGORITHM 
(GEWA) 


From the above discussion of all the major components and their characteris- 
tics, we realized that a good combination of local search and global search with 
a proper selection mechanism should produce a good metaheuristic algorithm, 
whatever the name it may be called. 

In principle, the global search should be carried out more frequently at the 
initial stage of the search or iterations. Once a number of good quality solutions 
are found, exploration should be sparse on the global scale, but frequent enough 
so as to escape any local trap if necessary. On the other hand, the local search 
should be carried out as efficient as possible, so a good local search method should 
be used. The proper balance of these two is paramount. 

Using these basic components, we can now design a generic, metaheuristic al- 
gorithm for optimization, we can call it the Generalized Evolutional Walk Algo- 
rithm (GEWA). Evolutionary walk is a random walk, but with a biased selection 
towards optimality. This is a generalized framework for global optimization. 

There are three major components in this algorithm: 1) global exploration by 
randomization, 2) intensive local search by random walk, and 3) the selection of 
the best with some elitism. The pseudo code of GEWA is shown in Fig. 12.1. 
The random walk should be carried out around the current global best g, so as 
to exploit the system information such as the current best more effectively. We 
have 

Tt41=—9, +U, (14.5) 


and 
w =ed, (14.6) 


where ¢ is drawn from a Gaussian distribution or normal distribution N(0, 07), 
and d is the step length vector which should be related to the actual scales of 
independent variables. For simplicity, we can take o = 1. 

The randomization step can be achieved by 


ti41=L+(U—-L)eu, (14.9) 


where €, is drawn from a uniform distribution Unif[0,1]. U and L are the upper 
and lower bound vectors, respectively. 

Typically, a = 0.25 ~ 0.7. We will use a = 0.5 in our implementation. 
Interested readers can try to do some parametric studies. 

Again two important issues are: 1) the balance of intensification and diversi- 
fication controlled by a single parameter a, and 2) the choice of the step size of 
the random walk. Parameter a is typically in the range of 0.25 to 0.7. The choice 
of the right step size is also important, as discussed in Section 4.3. The ratio of 
the step size to its length scale can be determined by (4.17), which is typically 

jMetaher,,, 0-001 to 0.01 for most applications. 


a Another important issue is the selection of the best and/or elitism. As we 


tend to discard the worst solution and replace it by generating new solution. 

Bis may implicitly weed out the least-fit solutions, while the solution with the 

ighest fitness remains in the population. The selection of the best and elitism 
‘S 7 ree . 

yan be guaranteed implicitly in the evolutionary walkers. 
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Initialize a population of n walkers x; (¢ = 1,2,...,7); 
Evaluate fitness F; of n walkers & find the current best g,; 
while (¢ <MaxGeneration) or (stop criterion); 
Discard the worst solution and replace it by (14.7) or (14.8); 
if (rand < a), 
Local search: random walk around the best 


T1441 = 9, + ed (14.7) 


else 
Global search: randomization 


Lt41 = L + (U _ Lie (14.8) 
end 
Evaluate new solutions and find the current best g‘; 
t=t+1; 
end while 


Postprocess results and visualization; 





Figure 14.1: Generalized Evolutionary Walk Algorithm (GEWA). 


Furthermore, the number (n) of random walkers is also important. Too few 
walkers are not efficient, while too many may lead to slow convergence. In general, 
the choice of n should follow the similar guidelines as those for all population- 
based algorithms. Typically, we can use n = 15 to 50 for most applications. 


% GEWA (Generalized evolutionary walker algorithm) h 
% by Xin-She Yang @ Cambridge University 2007-2009 h 
% Three major components in GEWA: h 
% 1) random walk 2)randomization 3) selection/elitism 7% 
fp SRR SARS Seana SoS SaaS nee Saaremaa aaa eras hh 
% Two algorithm-dependent parameters: h 

h n=population size or number of random walkers 4% 
h pa=randomization probability h 
hora SS SR a aS conc aSeeeeos hh 


function [bestsol,fval]=gewa(N_iter) 
% Default number of iterations 
gs Metal, menetestA N_iter=5000; end 
ension or number variables 
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% population size -- the number of walkers 
n=10; 


% Probability -- balance of local & global search 
alpha=0.5; 


% Random initial solutions 
ns=init_sol(n,Lb,Ub); 

% Evaluate all new solutions and find the best 
fval=init_fval (ns) ; 

[fbest ,sbest ,kbest]=findbest (ns, fval) ; 


% Iterations begin 
for j=1:N_iter, 


% Discard the worst and replace it later 
k=get_fmax(fval) ; 


if rand<alpha, 
% Local search by random walk 
ns(k,:)=rand_walk(sbest,Lb,Ub) ; 


else 

% Global search by randomization 
ns(k,:)=randomization(Lb, Ub) ; 

end 


fy Evaluation and selection of the best 
fval (k)=fobj(ns(k,:)); 
if fval(k)<fbest, 
fbest=fval (k) ; 
sbest=ns(k,:); 
kbest=k; 
end 
end % end of iterations 


% Post-processing and show all the solutions 
ns 

4, Show the best and number of evaluations 
Best_solution=sbest 

Best_fmin=fbest 

Number_Eval=N_iter+n 







os All subfunctions are placed here ---------------- 
#zInitial solutions 
S 


nction ns=init_sol(n,Lb,Ub) ; 
@r i=i:n, 
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end 


% Perform random walks around the best 
function s=rand_walk(sbest,Lb,Ub) 
step=0.01; 
s=sbesttrandn(size(sbest)).*(Ub-Lb) .*step; 


% Discard the worst solution and replace it later 
function k=get_fmax(fval) 
[fmax ,k]=max(fval) ; 


% Randomization in the whole search space 
function s=randomization (Lb, Ub) 
d=length (Lb) ; 

s=Lb+(Ub-Lb) . *rand(1,d) ; 


y/ Evaluations of all initial solutions 
function [fval]=init_fval (ns) 
n=size(ns,1); 
for k=i:n, 

fval(k)=fobj(ns(k,:)); 
end 


/, Find the best solution so far 
function [fbest,sbest,kbest]=findbest (ns,fval) 
n=size(ns,1); 
fbest=fval (1); 
sbest=ns(1,:); kbest=1; 
for k=2:n, 
if fval(k)<=fbest, 
fbest=fval (k); 
sbest=ns(k,:); 
kbest=k; 
end 
end 


% Objective function 

function z=fobj(u) 

% Rosenbrock’s 3D function 

z=(1-u (1) )*2+100* (u(2)-u(1) *2) °2+ (1-0 (3) ) °2; 
howrtttan end of the GEWA implementation ------- 






= EAGLE STRATEGY 


we discussed earlier, a fine balance of global exploration and local exploitation 
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Eagle Strategy 


Objective functions f(a), ..., fn(x) 
Initialization and random initial guess x 
while (stop criterion) 
Global exploration by randomization (e.g. Lévy flights) 
Evaluate the objectives and find a promising solution 
Intensive local search around a promising solution 

via an efficient local optimizer (e.g. hill-climbing) 

if (a better solution is found) 

Update the current best 

end 
Update t =t+1 
end 
Post-process the results and visualization. 





t=0 





Figure 14.2: Pseudo code of a two-stage eagle strategy for optimization. 


algorithms for exploration and exploitation. In fact, a proper combination of 
different algorithms can be used for different purposes. Eagle Strategy, developed 
by Xin-She Yang and Suash Deb in 2010, is such an attempt, as inspired by the 
foraging behaviour of an eagle. 
In essence, eagle strategy is a two-stage strategy. Firstly, we observe that an 
eagle searches for a pray by performing Lévy flights in the whole search domain. 
Once it finds a prey (a good solution), it changes to a chase strategy which 
focuses on the efficient capture of the prey (or solution). From the optimization 
point of view, this means that a global search is first performed by an efficient 
randomization technique such as Lévy flights, then an efficient local optimizer 
such as the steepest descent or downhill simplex methods can be used so as to 
find the local optimum quickly. To avoid being trapped locally, a global search 
is again performed, and followed by a local search. This proceeds iteratively in 
the same manner. This strategy can be summarized as the pseudo code shown 
in Fig. 14.2. 
It is worth pointing that this is a methodology or strategy, not an algorithm. 
In fact, we can use different algorithms at different stages and at different time of 
the iterations. The algorithm used for the global exploration should have enough 
randomness so as to explore the search space diversely and effectively. This 
process is typically slow initially, and should speed up as the system converges 
(or no better solutions can be found after a certain number of iterations). On 
the other hand, the algorithm used for the intensive local exploitation should be 
gh Metthear an efficient local optimizer such as the Nelder-Mead downhill simplex method or 
ay ‘the simplest hill-climbing. The idea is to reach the local optimality as quickly as 
: assible, with the minimal number of function evaluations. This stage should be 
st and efficient. 
aA good combination of the algorithms may be selected for a given problem, 
v Rnd this itself is an optimization problem. Ultimately, if an optimization system 
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can be built in such a way that the algorithms can be selected automatically and 
evolve accordingly, then intelligent algorithms can be developed to solve complex 
optimization problems efficiently. 


14.5 OTHER METAHEURISTIC ALGORITHMS 


There are a few other nature-inspired algorithms that are used in the literature. 
Some such as Tabu search are widely used, while others are gaining momentum. 
For example, both the photosynthetic algorithm and the enzyme algorithm are 
very specialized algorithms. In this chapter, we will briefly outline the basic con- 
cepts of these algorithms without implementation. Readers who are interested in 
these algorithms can refer to recent research journals and conference proceedings 
for more details. 


14.5.1 Tabu Search 


Tabu search, developed by Fred Glover in 1970s, is the search strategy that uses 
memory and search history as a major component. As most successful search 
algorithms such as gradient-based methods do not use memory, it seems that at 
first it is difficult to see how memory will improve the search efficiency. Thus, 
the use of memory poses subtleties that may not be immediately visible on the 
surface. This is because memory could introduce too many degrees of freedom, 
especially for the adaptive memory use, which makes it almost impossible to 
use rigorous mathematical analysis to establish the convergence and efficiency of 
such algorithms. Therefore, even though Tabu search works so well for certain 
problems, it is difficult to analyze mathematically. Consequently, Tabu search 
remains a heuristic approach. Two important issues that are still under active 
research are how to use the memory effectively and how to combine with other 
algorithms to create more superior new-generation algorithms. 

Tabu search is an intensive local search algorithm and the use of memory 
avoids the potential recycling of local solutions so as to increase its search effi- 
ciency. The recently tried or visited solutions are recorded and put into a Tabu 
list, and new solutions should avoid those in the Tabu list. Over a large number 
of iterations, this Tabu list could save tremendous amount of computing time 
and thus increase the search efficiency significantly. 

Tabu search was originally developed together with the integer programming, 
a special class of linear programming. The Tabu list in combination with inte- 
ger programming can reduce computing time by at least two orders for a given 
problem, comparing the solely standard integer programming. Tabu search can 
also be used along with many other algorithms. 


44.5.2 Photosynthetic and Enzyme Algorithm 







sfsynthetic algorithm was first developed by H. Murase in 2000 for finite el- 
inverse analysis of parameter estimations in agriculture engineering. This 
‘ithm was based on the photosynthesis in green plants. Photosynthesis uses 
Yer and carbon dioxide to produce glucose and oxygen in the presence of chloro- 
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plasts and light. The actual reaction is quite complicated, though it is often 
simplified as the following overall reaction 


light 


6CO2 + 6H20 — CegHi120¢6 +602. (14.10) 


In addition to light intensity and available chloroplasts, other factors controlling 
this reaction are temperature, concentration of CO2, and water content. With 
sufficient water and COzg, the overall reaction efficiency is largely determined by 
light intensity. 

In the photosynthetic algorithm, the product DHAP serves as the knowledge 
strings of the algorithm, and optimization is reached when the quality (or the 
fitness) of the product no longer improves. 

The enzyme algorithm (EA) was developed by Xin-She Yang at Cambridge 
University in 2005. It is based on the fundamental mechanism of enzyme reac- 
tions. The Michaelis-Menton quasi-steady state hypothesis of enzyme kinetics 
assumes the rapid, reversible formation of a complex between an enzyme and its 
substrate (S). The rate or velocity v is often governed by the classic Michaelis- 


Menton theory 


Vinax Vina 
= _ 14.11 
See, Ikke Cay 


where Vmax is the maximum velocity, and Km is the Michaelis constant. This 
relationship is very similar to that used in the photosynthetic algorithm. The aim 
is now to optimize the product (P), similar to the DHAP in the PA. The inhibition 
and cooperativity, in combination with forward and backward reactions, act as 
an interacting buffer so as to maximize the amount of the final product. It has 
been applied to solve optimization problems in engineering applications. 





14.5.3 Artificial Immune System and Others 


There are many other algorithms we have not covered in this book. One of the 
most important algorithms is the so-called artificial immune system, inspired by 
the characteristics of the immune system of mammals to use memory and learning 
as a novel approach to problem solving. The idea was proposed by Farmer et 
al. in 1986, with important work on immune networks by Bersini and Varela 
in 1990. It is an adaptive system with high potential. There are many variants 
developed over the last two decades, including the clonal selection algorithm, 
negative selection algorithm, immune networks and others. 

The memetic algorithm, proposed by P. Moscato in 1989, is a multi-generation, 
co-evolution and self-generation algorithm, and it can be considered as a hyper- 
heuristic algorithm, rather than metaheuristic. 

Another popular method is the cross-entropy method developed by Rubinstein 
in 1997. It is a generalized Monte Carlo method, based on the rare event simula- 

gh Metehctns tions. This algorithm consists of two phases: generation of random samples and 
x “apdate of the parameters. Here the aim is to minimize the c cross entropy. 
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Obviously, more and more metaheuristic algorithms will appear in the future. 
Interested readers can follow the latest literature and research journals. 


14.6 FURTHER RESEARCH 


14.6.1 Open Problems 


Despite the success of modern metaheuristic algorithms, there are many im- 
portant questions which remain unanswered. We know how these heuristic algo- 
rithms work, and we also partly understand why these algorithms work. However, 
it is difficult to analyze mathematically why these algorithms are so successful. 
In fact, these are unresolved open problems. 

Apart from the mathematical analysis on simulated annealing and particle 
swarm optimization, convergence of all other algorithms has not been proved 
mathematically, at least up to now. Any mathematical analysis will thus provide 
important insight into these algorithms. It will also be valuable for providing 
new directions for further important modifications on these algorithms or even 
pointing out innovative ways of developing new algorithms. 

In addition, it is still only partly understood why different components of 
heuristics and metaheuristics interact in a coherent and balanced way so that 
they produce efficient algorithms which converge under the given conditions. For 
example, why does a balanced combination of randomization and a deterministic 
component lead to a much more efficient algorithm (than a purely deterministic 
and/or a purely random algorithm)? How to measure or test if a balance is 
reached? How to prove that the use of memory can significantly increase the 
search efficiency of an algorithm? Under what conditions? 

From the well-known No-Free-Lunch theorems, we know that they have been 
proved for single objective optimization, but they remain unproved for multiob- 
jective optimization. If they are proved to be true (or not) for multiobjective 
optimization, what are the implications for algorithm development? 

If you are looking for some research topics, either for yourself or for your 
research students, these could form important topics for further research. 


14.6.2 To be Inspired or not to be Inspired 


We have seen in this book that nature-inspired algorithms are always based on a 

particular (often most successful) mechanism of the natural world. For example, 

bee algorithms are based on the optimal solution of foraging and storing the 
maximum amount of nectar. 

Nature has evolved over billions of years, she has found almost perfect solutions 

entered, every problem she has met. Almost all the not-so-good solutions have been 

x is séarded via natural selection. The optimal solutions seem (often after a huge 

iRer of generations) to appear at the evolutionarilly stable equilibrium, even 

gh we may not understand how the perfect solutions are reached. When we 

70 solve human problems, why not try to be inspired by the nature’s success? 

% simple answer to the question ‘To be inspired or not to be inspired?’ is ‘why 
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not?’. If we do not have good solutions at hand, it is always a good idea to learn 
from nature. 

Another important question is what algorithm to choose for a given problem? 
This depends on many factors such as the type of problem, the solution qual- 
ity, available computing resource, time limit (before which a problem must be 
solved), balance of advantages and disadvantages of each algorithm (another op- 
timization problem!), and the expertise of the decision-makers. When we study 
the algorithms, the efficiency and advantages as well as their disadvantages, to a 
large extent, essentially determine the type of problem they can solve and their 
potential applications. In general, for analytical function optimization problems, 
nature-inspired algorithms should not be the first choice if analytical methods 
work well. If the function is simple, we can use the stationary conditions (first 
derivatives must be zero) and extreme points (boundaries) to find the optimal 
solution(s). If this is not the best choice for a given problem, then calculus-based 
algorithms such as the steepest descent method should be tried. If these options 
fail, then nature-inspired algorithms can be used. On the other hand, for large- 
scale, nonlinear, global optimization problems, modern approaches tend to use 
metaheuristic algorithms (unless there is a particular algorithm already worked 
so well for the problem of interest). 

Now the question is why almost all the examples about numerical algorithms 
in this book (and in other books as well) are discussed using analytical functions? 
This is mainly for the purpose of validating new algorithms. The standard test 
functions such as Rosenbrock’s banana function and De Jong’s functions are 
becoming standard tests for comparing new algorithms against established al- 
gorithms because the latter have been well validated using these test functions. 
This will provide a good standard for comparison. 

Another important question is how to develop new algorithms. There are 
many ways of achieving a good formulation of new algorithms. Two good and 
successful ways are based on two basic ways of natural selection: explore new 
strategies and inherit the fittest strategies. Therefore, the first way of developing 
new algorithms is to design new algorithms using new discoveries. The second 
way is to formulate by hybrid and crossover. 

Another successful strategy used by nature is the adaptation to its new envi- 
ronment. We can also use the same strategy to modify existing algorithms and ex- 
plore their new applications. For any new optimization problems we might meet, 
we can try to modify existing successful algorithms, to suit for new applications, 
by either changing the controlling parameters or introducing new functionality. 
Many variants of numerical algorithms have been developed this way. 

In order to develop completely new nature-inspired algorithms, we have to 
observe, study and learn from nature. For example, I was always fascinated by 
the cobwebs spun by spiders. For a given environment, how does a spider decide 
to spin a web with such regularity so as to maximize the chance of catching some 
food? Are the cobweb’s location and pattern determined by the airflow? Surely, 







2Nature provides almost unlimited ways for problem-solving. If we can observe 
efully, we are surely inspired to develop more powerful and efficient new gen- 
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Ultimately some intelligent algorithms (or systems) may appear in the future, so 
that they can evolve and optimally adapt to solve NP-hard optimization problems 
efficiently and intelligently. 
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